kuhumcst / dannet Goto Github PK

The Danish WordNet as an RDF graph.

License: MIT License

Clojure 93.89% Dockerfile 0.75% CSS 5.31% Shell 0.05%

graph triples triplets rdf wordnet apache-jena jena owl triplestore knowledge-graph danish linked-data semantic-web

dannet's Introduction

DanNet is a WordNet for the Danish language. The goal of this project is to represent DanNet in full using RDF as its native representation at both the database level, in the application space, and as its primary serialisation format.

Compatibility

Special care has been taken to maximise the compatibility of this iteration of DanNet. Like the DanNet of yore, the base dataset is published as both RDF (Turtle) and CSV. RDF is the native representation and can be loaded as-is inside a suitable RDF graph database, e.g. Apache Jena. The CSV files are now published along with column metadata as CSVW.

Companion datasets

Apart from the base DanNet dataset, several companion datasets exist expanding the graph with additional data. The companion datasets collectively provide a broader view of the data with both implicit and explicit links to other data:

The COR companion dataset links DanNet resources to IDs from the COR project.
The DDS companion dataset decorates DanNet resources with sentiment data.
The OEWN extension companion dataset provides DanNet-like labels for the Open English WordNet to better facilitate browsing the connections between the two datasets.

The current version of the datasets can be downloaded on wordnet.dk/dannet. All of the releases from 2023 and onwards are available as releases on this project page.

Inferred data

Additional data is also implicitly inferred from the base dataset, the aforementioned companion datasets, and any associated ontological metadata. These inferred data points can be browsed along with the rest of the data on the official DanNet website.

Inferring data so can be both computationally expensive and mentally taxing for the consumer of the data, so we do not always publish the fully inferred graph in a DanNet release; when we do, those release will be specifically marked as containing this extra data.

Standards-based

The old DanNet was modelled as tables inside a relational database. Two serialised representations also exist: RDF/XML 1.0 and a custom CSV format. The latter served as input for the new data model, remapping the relations described in these files onto a modern WordNet based on the Ontolex-lemon standard combined with various relations defined by the Global Wordnet Association as used in the official GWA RDF standard.

In Ontolex-lemon...

Synsets are analogous to ontolex:LexicalConcept.
Word senses are analogous to ontolex:LexicalSense.
Words are analogous to ontolex:LexicalEntry.
Forms are analogous to ontolex:Form.

By building DanNet according to these standards we maximise its ability to integrate with other lexical resources, in particular with other WordNets.

Significant changes

New schema, prefixes, URIs

DanNet uses a new schema, available in this repository and also at https://wordnet.dk/dannet/schema.

DanNet uses the following URI prefixes for the dataset instances, concepts (members of a dns:ontologicalType) and the schema itself:

dn -> https://wordnet.dk/dannet/data/
dnc -> https://wordnet.dk/dannet/concepts/
dns -> https://wordnet.dk/dannet/schema/

NOTE: these new prefixes/URIs take over from the ones used for DanNet 2.2 (the last version before the 2023 re-release):

dn -> http://www.wordnet.dk/owl/instance/2009/03/instances/

dn_schema -> http://www.wordnet.dk/owl/instance/2009/03/schema/

All the new URIs resolve to HTTP resources, which is to say that accessing a resource with a GET request (e.g. through a web browser) returns data for the resource (or schema) in question.

Finally, the new DanNet schema is written in accordance with the RDF conventions listed by Philippe Martin.

Implementation

The main database that the new tooling has been developed for is Apache Jena, which is a mature RDF triplestore that also supports OWL inferences. When represented inside Jena, the many relations of DanNet are turned into a queryable knowledge graph. The new DanNet is developed in the Clojure programming language (an alternative to Java on the JVM) which has multiple libraries for interacting with the Java-based Apache Jena, e.g. Aristotle and igraph-jena.

However, standardising on the basic RDF triple abstraction does open up a world of alternative data stores, query languages, and graph algorithms. See rationale.md for more.

Clojure support

In its native Clojure representation, DanNet can be queried in a variety of ways (described in queries.md). It is especially convenient to query data from within a Clojure REPL.

Support for Apache Jena transactions is built-in and enabled automatically when needed. This ensures support for persistence on disk through the TDB layer included with Apache Jena (mandatory for TDB 2). Both in-memory and persisted graphs can thus be queried using the same function calls. The DanNet website contains the complete dataset inside a TDB 2 graph.

Furthermore, DanNet query results are all decorated with support for the Clojure Navigable protocol. The entire RDF graph can therefore easily be navigated in tools such as Morse or Reveal from a single query result.

Web app

Note: A more detailed explanation is available at doc/web.md.

The frontend is written in ClojureScript. It is rendered using Rum and is served by Pedestal in the backend. If JavaScript is turned on, the initial HTML page becomes the entrypoint of a single-page app. If JavaScript is unavailable, this web app converts to a regular HTML website.

The URIs of each of the resources in DanNet resolve to actual HTML pages with content relating to the resource at the IRI. However, every DanNet resource has both an HTML representation and several other representations which can be accessed via HTTP content negotiation.

When JavaScript is disabled, usually only the HTML representation is used by the browser. However, when JavaScript is available, a frontend router (reitit) reroutes all navigation requests (e.g. clicking a hyperlink or submitting a form) towards fetching the application/transit+json representation instead. This data is used to refresh the Rum components, allowing them to update in place, while "fake" browser history item is inserted by reitit. The very same Rum components are also used to render the static HTML webpages.

Language negotiation is used to select the most suitable RDF data when multiple languages are available in the dataset.

Bootstrap

Initial bootstrap

The initial dataset was bootstrapped from the old DanNet 2.2 CSV files (technically: a slightly more recent, unpublished version) as well as several other input sources, e.g. the list of new adjectives produced by CST and DSL. This old CSV export mirrors the SQL tables of the old DanNet database.

Current releases

New releases of DanNet are now bootstrapped from the RDF export of the immediately preceding release.

In dk.cst.dannet.db.bootstrap the raw data from the previous version of DanNet is loaded into memory, cleaned up, and converted into triple data structures using the new RDF schema structure. These triples are imported into several Apache Jena graphs and the planned release changes to the dataset written in Clojure code are applied to these graphs. The union of these graphs is accessed through an InfGraph which also triggers inference of additional triples as defined in the associated OWL/RDFS schemas.

Finally, on the final run of this bootstrap process, the graph is exported into an RDF dataset. This dataset constitutes the new official version of DanNet. A smaller CSV dataset is also created, but this is not the full or canonical version of the data.

NOTE: the data used for bootstrapping should be located inside the ./boostrap subdirectory (relative to the execution directory).

Setup

The code is all written in Clojure and it must be compiled to Java Bytecode and run inside a Java Virtual Machine (JVM). The primary means to do this is Clojure's official CLI tools which can both fetch dependencies and build/run Clojure code. The project dependencies are specified in the deps.edn file.

While developing, I typically launch a new local DanNet web service using the restart function in dk.cst.dannet.web.service. This makes the service available at localhost:3456. The Apache Jena database will be spun up as part of this process.

The frontend must be run concurrently using shadow-cljs by running the following command in the terminal:

npx shadow-cljs watch app

Testing a release build

While developing, ideally you should be running code in a Clojure REPL.

However, when testing release you can either run the docker compose setup from inside the ./docker directory using the following command:

docker compose up --build

Usually, the Caddy container can keep running in between restarts, i.e. only the DanNet container should be rebuilt:

docker compose up -d dannet --build

NOTE: requires that the Docker daemon is installed and running!

Or you may build and run a new release manually from this directory:

shadow-cljs --aliases :frontend release app
clojure -T:build org.corfield.build/uber :lib dk.cst/dannet :main dk.cst.dannet.web.service :uber-file "\"dannet.jar\""
java -jar -Xmx4g dannet.jar

NOTE: requires that Java, Clojure, and shadow-cljs are all installed.

By default, the web service is accessed on localhost:3456. The data is loaded into a TDB2 database located in the ./db/tdb2 directory.

Regular operation of wordnet.dk/dannet

The system is registered as a systemd service which ensures smooth running between restarts:

cp system/dannet.service /etc/systemd/system/dannet.service
systemctl enable dannet
systemctl start dannet

This service merely delegates to the Docker daemon and attempts to ensure that both the Caddy reverse proxy and DanNet web service are available when the host OS is updated.

However, when doing a new release (NOTE: requires updating the database and various files on disk), it might be beneficial to shut down only the DanNet web service, not the Caddy reverse proxy, by using docker compose commands directly (see next section).

Making a release on wordnet.dk/dannet

The current release workflow assumes that the database and the export files are created on a development machine and the transferred to the production server. During the transfer, the DanNet web service will momentarily be down, so keep this in mind!

To build the database, load a Clojure REPL and load the dk.cst.dannet.web.service namespace. From here, execute (restart) to get a service up and running. When the service is up, go to the dk.cst.dannet.db namespace and execute either of the following:

;; A standard RDF, CSV & WN-LMF export
(export-rdf! @dk.cst.dannet.web.resources/db)
(export-csv! @dk.cst.dannet.web.resources/db)
(export-wn-lmf! "export/wn-lmf/")

;; The entire, realised dataset including inferences can also be written do disk.
;; Note: exporting the complete dataset (including inferences) usually takes ~40-45 minutes
(export-rdf! @dk.cst.dannet.web.resources/db :complete true)

Normally, the Caddy service can keep running, so only the DanNet service needs to be briefly stopped:

# from inside the docker/ directory on the production server
docker compose stop dannet

Once the service is down, the database and export files can be transferred using SFTP to the relevant directories on the server. The git commit on the production server should also match the uploaded data, of course!

After transferring the entire, zipped database as e.g. tdb2.zip, you may unzip it at the final destination using this command, which will overwrite the existing files:

unzip -o tdb2.zip -d /dannet/db/

The service is finally restarted with:

docker compose up -d dannet --build

When updating the database, you will likely also need to update the exported files. These are zip files which reside in either /dannet/export/csv or /dannet/export/rdf. I typically just move them to the server using Cyberduck and the run

mv cor.zip dannet.zip dds.zip oewn-extension.zip /dannet/export/rdf/
mv dannet-csv.zip /dannet/export/csv/
mv dannet-wn-lmf.xml.gz /dannet/export/wn-lmf/

Memory usage

Currently, the entire system, including the web service, uses ~1.4 GB when idle and ~3GB when rebuilding the Apache Jena database. A server should therefore have perhaps 4GB of available RAM to run the full version of DanNet.

Frontend dependencies

DanNet depends on React 17 since the React wrapper Rum depends on this version of React:

npm init -y
npm install react@17 react-dom@17 create-react-class@17

Querying DanNet

The easiest way to query DanNet currently is by compiling and running the Clojure code, then navigating to the dk.cst.dannet.db namespace in the Clojure REPL. From there, you can use a variety of query methods as described in queries.md.

For simple lemma searches, you can of course visit the official instance at wordnet.dk/dannet.

dannet's People

Contributors

Stargazers

Watchers

dannet's Issues

CSV published as CSVW?

See: https://csvw.org/

Apparently a product of Rick Moynihan and others. Could be an interesting way to marry RDF/CSV?

Attribute overviews for dataset resources

First of all, dataset IRIs need to be discovered. Various methods can be used, e.g. is the object of a :rdfs/isDefinedBy rel or is the subject of a :vann/preferredNamespaceUri or :vann/preferredNamespacePrefix rel.

When a resource can reliably be said to be a dataset, a special query can be used to attach all of the attributes defined in that namespace and list them, perhaps on a separate page.

Make entire setup configurable

Basically, the overall layout, colours, etc. should all be configurable using a Clojure map. Should probably be on a by-resource-class basis, allowing for more flexible pages.

Add metadata triples

Include usages in Jena db

The usages have been extracted and should probably be represented as

#{[wordsense :ontolex/usage '_usage]
  ['_usage :rdf/value usage-example]}

but they will need to be added to the Jena db using a suitable query that can find the right wordsense to attach them to, based on a Synset and a writtenRep.

Unfortunately, usage is not a well-specified part of the Ontolex standard.

Connotation

Once relation which doesn't seem to exist in the CSV files, but which does exist in the old DanNet RDF is the dns/connotation relation which can be e.g. "negative". I guess this is the sentiment data that Sussi is still working on. Rather than attempting to parse the old XML, it might be better to simply ignore these for now and include them at a later point, getting the data directly from Sussi.

Language negotiation interceptor

The default content negotiation interceptor of Pedestal doesn't do language negotiation. However, according to one user on Clojurians Slack, it is possible to copy and reuse most of the existing interceptor to also do language negotiation:

Fredrik 19:17
The argument :content-param-paths specifies where to look for the "Accept" header. If you set it to [:request :headers "accept-language"] , and change this line to store the negotiation result in the correct place, you should be good. https://github.com/pedestal/pedestal/blob/d20065013abf5d3793ae5301e18a2398707fa2a9/service/src/io/pedestal/http/content_negotiation.clj#L186
content_negotiation.clj
(assoc-in ctx [:request :accept] content-match)

So ideally, this interceptor is inserted ahead of all routes and the content-generating interceptors make use of this language information.

Fix "inserted by DanNet" senses

See:

Some senses are "inserted by DanNet" and thus have missing words and are missing proper labels. This should be remedied.

wrong owl:sameAs for m/k'er

see: http://localhost:3456/dannet/data/word-51001426

Database architecture

Currently, the Jena implementation can be persistent through TDB, but not when coupled with a reasoner doing inference. I am unsure whether if means that inference cannot realistically be coupled with TDB. Here are some options for database architecture:

A single, shared, fully in-memory graph (with inferencing)
- This would require a serious backup solution running every time the graph is updated.
An in-memory graph for viewing (with inferencing) + a TDB database for editing (NO inferencing)
- The editor can potentially be joined by another in-memory graph with inferencing to display results. Both TDB and in-memory graphs would then be manipulated.
- The viewer graph would live entirely separate, only having occasional refreshes.
A single, shared, TDB graph with inferencing (likely as an in-memory layer)
- Currently unsure whether this is possible.
- Would be the the ideal solution as it ensures both persistence and instant updates.
A single, shared TDB graph with NO inferencing + occassional updates with data from an OntologyGraph.
- This solution keeps the instant updates, but not of the inferred portions. These arrive at a later point once the graph is fully updated based on triples from an in-memory OntologyGraph.

Split ontologicalType

Rather than being represented as a single triples containing the current text literal (e.g. "BoundedEvent+Agentive+Mental+Purpose"), the ontological type(s) should be represented as a set of triples - one for each of the types contained in the text.

Searches for a prototypical collection of ontological types can still happen, but with this change the types can also be queried individually.

They seem to be derived from https://archive.illc.uva.nl/EuroWordNet/corebcs/ewnTopOntology.html although no OWL schema exists, so the relation and the concepts themselves will still have to be part of a DanNet OWL schema rather than an external one.

Research datastore alternatives

https://en.wikipedia.org/wiki/Comparison_of_triplestores

Datastores tested

Apache Jena (arachne)
Neo4j

Decorate HTML with RDFa

See: https://rdfa.info/

Linking to previous version of the schema

The new DanNet schemas should reference the older schemas for good measure.

ELEXIS Protocol for accessing dictionaries (1.1)

See: https://elexis-eu.github.io/elexis-rest/

Bolette suggested implementing this API, after hearing Taryn talk about this, that we do the same.

From what I can tell, much of the work is essentially about reworking the search endpoint and extending/proxying the existing content-negotiation solution so that Turtle and JSON-LD (at least) will be supported.

Dynamic prefix display

Currently, prefixes are hidden by default in the value part of the attribute-value table. The original intention was to let the user (eventually) hide/show the prefixes through some toggle.

However, now I think it might make the most sense to dynamically display prefixes in that column through a simple rule:

If the value column prefix matches the prefix of the current RDF resource -OR- if it matches the prefix of the attribute column, then it is hidden. This would essentially hide all relations that are part of the same dataset. The colour would be left to make their implicit relation more obvious.

In all other cases, the prefixes will be shown. This will make it clear that we are referencing e.g. cor:kickstarter in the owl:sameAs relation emanating from dn:kickstarter, while removing a tonne of superfluous ontolex: and dn: prefixes from view.

Persistence using Jena

Aristotle only wraps in-memory databases, but we will surely need to kind of persistence layer. Apache Jena's persistent triplestore is called TDB. There is also SDB which uses SQL databases as storage, but this has been deprecrated.

In order to use, TDB I will need to learn a bit more about the object-oriented architecture of Jena, specifically the relation between Graph vs Model vs Dataset classes.

Add COR-K IDs to DanNet graph

Since these IDs are not supposed to be a part of the real DanNet dataset, I should probably define them in a separate schema as well as exporting them separately.

fix RDF schema retrieval time out

Some of the w3c schema do not always properly resolve. They should probably be downloaded and cached locally.

Typical error:

[qtp454953625-30] INFO io.pedestal.http - {:msg "GET /dannet/data", :line 80}
[qtp454953625-30] INFO io.pedestal.http.cors - {:msg "cors request processing", :origin nil, :allowed true, :line 84}
[qtp454953625-30] ERROR io.pedestal.http.impl.servlet-interceptor - {:msg "error-ring-response triggered", :context {:response nil, :io.pedestal.interceptor.chain/stack (#Interceptor{:name :io.pedestal.http.impl.servlet-interceptor/stylobate} #Interceptor{:name :io.pedestal.http.impl.servlet-interceptor/terminator-injector}), :request {:protocol "HTTP/1.1", :async-supported? true, :remote-addr "[0:0:0:0:0:0:0:1]", :servlet-response #object[org.eclipse.jetty.server.Response 0x6222621b "HTTP/1.1 200 \nDate: Mon, 28 Feb 2022 16:44:05 GMT\r\n\r\n"], :servlet #object[io.pedestal.http.servlet.FnServlet 0x250e158e "io.pedestal.http.servlet.FnServlet@250e158e"], :headers {"sec-fetch-site" "none", "sec-ch-ua-mobile" "?0", "host" "localhost:8080", "user-agent" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.109 Safari/537.36", "sec-fetch-user" "?1", "sec-ch-ua" "\" Not A;Brand\";v=\"99\", \"Chromium\";v=\"98\", \"Google Chrome\";v=\"98\"", "sec-ch-ua-platform" "\"macOS\"", "connection" "keep-alive", "upgrade-insecure-requests" "1", "accept" "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9", "accept-language" "da,en-GB;q=0.9,en-US;q=0.8,en;q=0.7,zh-CN;q=0.6,zh;q=0.5", "sec-fetch-dest" "document", "accept-encoding" "gzip, deflate, br", "sec-fetch-mode" "navigate", "cache-control" "max-age=0"}, :server-port 8080, :servlet-request #object[org.eclipse.jetty.server.Request 0x6180c05a "Request(GET //localhost:8080/dannet/data)@6180c05a"], :path-info "/dannet/data", :url-for #object[clojure.lang.Delay 0x115d2083 {:status :pending, :val nil}], :uri "/dannet/data", :server-name "localhost", :query-string nil, :path-params {}, :body #object[org.eclipse.jetty.server.HttpInputOverHTTP 0x728a946a "HttpInputOverHTTP@728a946a[c=0,q=0,[0]=null,s=STREAM]"], :accept-language {:field "da", :type "da", :subtype nil}, :scheme :http, :request-method :get, :context-path "", :accept {:field "text/html", :type "text", :subtype "html"}}, :bindings {#'io.pedestal.http.route/*url-for* #object[clojure.lang.Delay 0x115d2083 {:status :pending, :val nil}]}, :enter-async [#object[io.pedestal.http.impl.servlet_interceptor$start_servlet_async 0x349029a "io.pedestal.http.impl.servlet_interceptor$start_servlet_async@349029a"]], :io.pedestal.interceptor.chain/terminators (#object[io.pedestal.http.impl.servlet_interceptor$terminator_inject$fn__15172 0x642c90b9 "io.pedestal.http.impl.servlet_interceptor$terminator_inject$fn__15172@642c90b9"]), :servlet-response #object[org.eclipse.jetty.server.Response 0x6222621b "HTTP/1.1 200 \nDate: Mon, 28 Feb 2022 16:44:05 GMT\r\n\r\n"], :route {:path "/dannet/data", :method :get, :path-re #"/\Qdannet\E/\Qdata\E", :path-parts ["dannet" "data"], :interceptors [#Interceptor{:name :io.pedestal.http.content-negotiation/negotiate-content} #Interceptor{:name :dk.cst.dannet.web.resources/negotiate-language} #Interceptor{:name :dk.cst.dannet.web.resources/entity}], :route-name :clojure.core/dn-dataset-entity, :path-params {}, :io.pedestal.http.route.prefix-tree/satisfies-constraints? #object[clojure.core$constantly$fn__5740 0x1c0ded87 "clojure.core$constantly$fn__5740@1c0ded87"]}, :servlet #object[io.pedestal.http.servlet.FnServlet 0x250e158e "io.pedestal.http.servlet.FnServlet@250e158e"], :servlet-request #object[org.eclipse.jetty.server.Request 0x6180c05a "Request(GET //localhost:8080/dannet/data)@6180c05a"], :url-for #object[clojure.lang.Delay 0x115d2083 {:status :pending, :val nil}], :io.pedestal.interceptor.chain/execution-id 1, :servlet-config #object[org.eclipse.jetty.servlet.ServletHolder$Config 0x3a8213a8 "org.eclipse.jetty.servlet.ServletHolder$Config@3a8213a8"], :async? #object[io.pedestal.http.impl.servlet_interceptor$servlet_async_QMARK_ 0x71949bb0 "io.pedestal.http.impl.servlet_interceptor$servlet_async_QMARK_@71949bb0"]}, :line 253}
clojure.lang.ExceptionInfo: java.util.concurrent.ExecutionException in Interceptor :dk.cst.dannet.web.resources/entity - org.apache.jena.atlas.web.HttpException: org.apache.http.conn.HttpHostConnectException: Connect to www.w3.org:80 [www.w3.org/128.30.52.100] failed: Operation timed out {:execution-id 1, :stage :leave, :interceptor :dk.cst.dannet.web.resources/entity, :exception-type :java.util.concurrent.ExecutionException, :exception #error {
 :cause "Operation timed out"
 :via
 [{:type java.util.concurrent.ExecutionException
   :message "org.apache.jena.atlas.web.HttpException: org.apache.http.conn.HttpHostConnectException: Connect to www.w3.org:80 [www.w3.org/128.30.52.100] failed: Operation timed out"
   :at [java.util.concurrent.FutureTask report "FutureTask.java" 122]}
  {:type org.apache.jena.atlas.web.HttpException
   :message "org.apache.http.conn.HttpHostConnectException: Connect to www.w3.org:80 [www.w3.org/128.30.52.100] failed: Operation timed out"
   :at [org.apache.jena.riot.web.HttpOp exec "HttpOp.java" 1095]}
  {:type org.apache.http.conn.HttpHostConnectException
   :message "Connect to www.w3.org:80 [www.w3.org/128.30.52.100] failed: Operation timed out"
   :at [org.apache.http.impl.conn.DefaultHttpClientConnectionOperator connect "DefaultHttpClientConnectionOperator.java" 156]}
  {:type java.net.ConnectException
   :message "Operation timed out"
   :at [sun.nio.ch.Net connect0 "Net.java" -2]}]
 :trace
 [[sun.nio.ch.Net connect0 "Net.java" -2]
  [sun.nio.ch.Net connect "Net.java" 579]
  [sun.nio.ch.Net connect "Net.java" 568]
  [sun.nio.ch.NioSocketImpl connect "NioSocketImpl.java" 588]
  [java.net.SocksSocketImpl connect "SocksSocketImpl.java" 327]
  [java.net.Socket connect "Socket.java" 633]
  [org.apache.http.conn.socket.PlainConnectionSocketFactory connectSocket "PlainConnectionSocketFactory.java" 75]
  [org.apache.http.impl.conn.DefaultHttpClientConnectionOperator connect "DefaultHttpClientConnectionOperator.java" 142]
  [org.apache.http.impl.conn.PoolingHttpClientConnectionManager connect "PoolingHttpClientConnectionManager.java" 376]
  [org.apache.http.impl.execchain.MainClientExec establishRoute "MainClientExec.java" 393]
  [org.apache.http.impl.execchain.MainClientExec execute "MainClientExec.java" 236]
  [org.apache.http.impl.execchain.ProtocolExec execute "ProtocolExec.java" 186]
  [org.apache.http.impl.execchain.RetryExec execute "RetryExec.java" 89]
  [org.apache.http.impl.execchain.RedirectExec execute "RedirectExec.java" 110]
  [org.apache.http.impl.client.InternalHttpClient doExecute "InternalHttpClient.java" 185]
  [org.apache.http.impl.client.CloseableHttpClient execute "CloseableHttpClient.java" 83]
  [org.apache.http.impl.client.CloseableHttpClient execute "CloseableHttpClient.java" 56]
  [org.apache.jena.riot.web.HttpOp exec "HttpOp.java" 1082]
  [org.apache.jena.riot.web.HttpOp execHttpGet "HttpOp.java" 322]
  [org.apache.jena.riot.web.HttpOp execHttpGet "HttpOp.java" 381]
  [org.apache.jena.riot.RDFParser openTypedInputStream "RDFParser.java" 389]
  [org.apache.jena.riot.RDFParser parseURI "RDFParser.java" 302]
  [org.apache.jena.riot.RDFParser parse "RDFParser.java" 296]
  [org.apache.jena.riot.RDFParserBuilder parse "RDFParserBuilder.java" 540]
  [org.apache.jena.riot.RDFDataMgr parseFromURI "RDFDataMgr.java" 921]
  [org.apache.jena.riot.RDFDataMgr read "RDFDataMgr.java" 252]
  [org.apache.jena.riot.RDFDataMgr read "RDFDataMgr.java" 221]
  [org.apache.jena.riot.RDFDataMgr read "RDFDataMgr.java" 151]
  [org.apache.jena.riot.RDFDataMgr read "RDFDataMgr.java" 142]
  [org.apache.jena.riot.adapters.RDFReaderRIOT read "RDFReaderRIOT.java" 76]
  [org.apache.jena.rdf.model.impl.ModelCom read "ModelCom.java" 259]
  [jdk.internal.reflect.NativeMethodAccessorImpl invoke0 "NativeMethodAccessorImpl.java" -2]
  [jdk.internal.reflect.NativeMethodAccessorImpl invoke "NativeMethodAccessorImpl.java" 77]
  [jdk.internal.reflect.DelegatingMethodAccessorImpl invoke "DelegatingMethodAccessorImpl.java" 43]
  [java.lang.reflect.Method invoke "Method.java" 568]
  [clojure.lang.Reflector invokeMatchingMethod "Reflector.java" 167]
  [clojure.lang.Reflector invokeInstanceMethod "Reflector.java" 102]
  [dk.cst.dannet.db$__GT_schema_model$fn__27300 invoke "db.clj" 46]
  [clojure.core.protocols$fn__8249 invokeStatic "protocols.clj" 168]
  [clojure.core.protocols$fn__8249 invoke "protocols.clj" 124]
  [clojure.core.protocols$fn__8204$G__8199__8213 invoke "protocols.clj" 19]
  [clojure.core.protocols$seq_reduce invokeStatic "protocols.clj" 31]
  [clojure.core.protocols$fn__8236 invokeStatic "protocols.clj" 75]
  [clojure.core.protocols$fn__8236 invoke "protocols.clj" 75]
  [clojure.core.protocols$fn__8178$G__8173__8191 invoke "protocols.clj" 13]
  [clojure.core$reduce invokeStatic "core.clj" 6886]
  [clojure.core$reduce invoke "core.clj" 6868]
  [dk.cst.dannet.db$__GT_schema_model invokeStatic "db.clj" 43]
  [dk.cst.dannet.db$__GT_schema_model invoke "db.clj" 40]
  [dk.cst.dannet.db$__GT_dannet invokeStatic "db.clj" 142]
  [dk.cst.dannet.db$__GT_dannet doInvoke "db.clj" 115]
  [clojure.lang.RestFn invoke "RestFn.java" 457]
  [dk.cst.dannet.web.resources$eval27472$fn__27473 invoke "resources.clj" 32]
  [clojure.core$binding_conveyor_fn$fn__5823 invoke "core.clj" 2047]
  [clojure.lang.AFn call "AFn.java" 18]
  [java.util.concurrent.FutureTask run "FutureTask.java" 264]
  [java.util.concurrent.ThreadPoolExecutor runWorker "ThreadPoolExecutor.java" 1136]
  [java.util.concurrent.ThreadPoolExecutor$Worker run "ThreadPoolExecutor.java" 635]
  [java.lang.Thread run "Thread.java" 833]]}}
	at io.pedestal.interceptor.chain$throwable__GT_ex_info.invokeStatic(chain.clj:35)
	at io.pedestal.interceptor.chain$throwable__GT_ex_info.invoke(chain.clj:32)
	at io.pedestal.interceptor.chain$try_f.invokeStatic(chain.clj:57)
	at io.pedestal.interceptor.chain$try_f.invoke(chain.clj:44)
	at io.pedestal.interceptor.chain$leave_all_with_binding.invokeStatic(chain.clj:254)
	at io.pedestal.interceptor.chain$leave_all_with_binding.invoke(chain.clj:237)
	at io.pedestal.interceptor.chain$leave_all$fn__11219.invoke(chain.clj:268)
	at clojure.lang.AFn.applyToHelper(AFn.java:152)
	at clojure.lang.AFn.applyTo(AFn.java:144)
	at clojure.core$apply.invokeStatic(core.clj:667)
	at clojure.core$with_bindings_STAR_.invokeStatic(core.clj:1990)
	at clojure.core$with_bindings_STAR_.doInvoke(core.clj:1990)
	at clojure.lang.RestFn.invoke(RestFn.java:425)
	at io.pedestal.interceptor.chain$leave_all.invokeStatic(chain.clj:266)
	at io.pedestal.interceptor.chain$leave_all.invoke(chain.clj:260)
	at io.pedestal.interceptor.chain$execute.invokeStatic(chain.clj:379)
	at io.pedestal.interceptor.chain$execute.invoke(chain.clj:352)
	at io.pedestal.interceptor.chain$execute.invokeStatic(chain.clj:389)
	at io.pedestal.interceptor.chain$execute.invoke(chain.clj:352)
	at io.pedestal.http.impl.servlet_interceptor$interceptor_service_fn$fn__15197.invoke(servlet_interceptor.clj:351)
	at io.pedestal.http.servlet.FnServlet.service(servlet.clj:28)
	at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:799)
	at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:550)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)
	at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1434)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188)
	at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:501)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186)
	at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1349)
	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
	at org.eclipse.jetty.server.Server.handle(Server.java:516)
	at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:400)
	at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:645)
	at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:392)
	at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277)
	at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
	at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
	at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)
	at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883)
	at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034)
	at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.util.concurrent.ExecutionException: org.apache.jena.atlas.web.HttpException: org.apache.http.conn.HttpHostConnectException: Connect to www.w3.org:80 [www.w3.org/128.30.52.100] failed: Operation timed out
	at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
	at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)
	at clojure.core$deref_future.invokeStatic(core.clj:2317)
	at clojure.core$future_call$reify__8544.deref(core.clj:7041)
	at clojure.core$deref.invokeStatic(core.clj:2337)
	at clojure.core$deref.invoke(core.clj:2323)
	at dk.cst.dannet.web.resources$__GT_entity_ic$fn__27528.invoke(resources.clj:151)
	at io.pedestal.interceptor.chain$try_f.invokeStatic(chain.clj:54)
	... 39 more
Caused by: org.apache.jena.atlas.web.HttpException: org.apache.http.conn.HttpHostConnectException: Connect to www.w3.org:80 [www.w3.org/128.30.52.100] failed: Operation timed out
	at org.apache.jena.riot.web.HttpOp.exec(HttpOp.java:1095)
	at org.apache.jena.riot.web.HttpOp.execHttpGet(HttpOp.java:322)
	at org.apache.jena.riot.web.HttpOp.execHttpGet(HttpOp.java:381)
	at org.apache.jena.riot.RDFParser.openTypedInputStream(RDFParser.java:389)
	at org.apache.jena.riot.RDFParser.parseURI(RDFParser.java:302)
	at org.apache.jena.riot.RDFParser.parse(RDFParser.java:296)
	at org.apache.jena.riot.RDFParserBuilder.parse(RDFParserBuilder.java:540)
	at org.apache.jena.riot.RDFDataMgr.parseFromURI(RDFDataMgr.java:921)
	at org.apache.jena.riot.RDFDataMgr.read(RDFDataMgr.java:252)
	at org.apache.jena.riot.RDFDataMgr.read(RDFDataMgr.java:221)
	at org.apache.jena.riot.RDFDataMgr.read(RDFDataMgr.java:151)
	at org.apache.jena.riot.RDFDataMgr.read(RDFDataMgr.java:142)
	at org.apache.jena.riot.adapters.RDFReaderRIOT.read(RDFReaderRIOT.java:76)
	at org.apache.jena.rdf.model.impl.ModelCom.read(ModelCom.java:259)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
	at clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:167)
	at clojure.lang.Reflector.invokeInstanceMethod(Reflector.java:102)
	at dk.cst.dannet.db$__GT_schema_model$fn__27300.invoke(db.clj:46)
	at clojure.core.protocols$fn__8249.invokeStatic(protocols.clj:168)
	at clojure.core.protocols$fn__8249.invoke(protocols.clj:124)
	at clojure.core.protocols$fn__8204$G__8199__8213.invoke(protocols.clj:19)
	at clojure.core.protocols$seq_reduce.invokeStatic(protocols.clj:31)
	at clojure.core.protocols$fn__8236.invokeStatic(protocols.clj:75)
	at clojure.core.protocols$fn__8236.invoke(protocols.clj:75)
	at clojure.core.protocols$fn__8178$G__8173__8191.invoke(protocols.clj:13)
	at clojure.core$reduce.invokeStatic(core.clj:6886)
	at clojure.core$reduce.invoke(core.clj:6868)
	at dk.cst.dannet.db$__GT_schema_model.invokeStatic(db.clj:43)
	at dk.cst.dannet.db$__GT_schema_model.invoke(db.clj:40)
	at dk.cst.dannet.db$__GT_dannet.invokeStatic(db.clj:142)
	at dk.cst.dannet.db$__GT_dannet.doInvoke(db.clj:115)
	at clojure.lang.RestFn.invoke(RestFn.java:457)
	at dk.cst.dannet.web.resources$eval27472$fn__27473.invoke(resources.clj:32)
	at clojure.core$binding_conveyor_fn$fn__5823.invoke(core.clj:2047)
	at clojure.lang.AFn.call(AFn.java:18)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	... 1 more
Caused by: org.apache.http.conn.HttpHostConnectException: Connect to www.w3.org:80 [www.w3.org/128.30.52.100] failed: Operation timed out
	at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:156)
	at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:376)
	at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393)
	at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
	at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
	at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
	at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
	at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
	at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
	at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
	at org.apache.jena.riot.web.HttpOp.exec(HttpOp.java:1082)
	... 41 more
Caused by: java.net.ConnectException: Operation timed out
	at java.base/sun.nio.ch.Net.connect0(Native Method)
	at java.base/sun.nio.ch.Net.connect(Net.java:579)
	at java.base/sun.nio.ch.Net.connect(Net.java:568)
	at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:588)
	at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:327)
	at java.base/java.net.Socket.connect(Socket.java:633)
	at org.apache.http.conn.socket.PlainConnectionSocketFactory.connectSocket(PlainConnectionSocketFactory.java:75)
	at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142)
	... 51 more
[qtp454953625-30] INFO io.pedestal.http.impl.servlet-interceptor - {:msg "sending error", :message "Internal server error: exception", :line 215}

Visual graph element

Some kind of React graph element, presumably.

Should probably be a special element that can be opened or a separate page that is navigated to.

Requires moving the frontend dependencies from deps.edn into shadow-cljs.edn since I will need to use the NPM library integration of shadow-cljs.

Fix Trie + implement UI for search suggestions

Perhaps the easy fix is to separate frontend code into a shadow-cljs.edn file, although I am unsure whether this will be possible since shadow-cljs does depend on some backend libs too.

Other resources:

https://modulolotus.net/posts/2018-05-31-clojure-trie-performance/

Duplicate synsets

{ bevæge§4 sig }
- http://localhost:3456/dannet/data/synset-31753
- http://localhost:3456/dannet/data/synset-77983
- seems to be a case of #60.
{ advarselsskilt§1 • advarselstavle§2 }
- http://localhost:3456/dannet/data/synset-41779
- http://localhost:3456/dannet/data/synset-58942
- seems to be signs on a road vs. not a road, but difficult to tell.
{ Jomfru§4 Marias sengehalm }
- http://localhost:3456/dannet/data/synset-558
- http://localhost:3456/dannet/data/synset-774
- seems to be an actual duplicate.
{ amarant§3 }
- http://localhost:3456/dannet/data/synset-568
- http://localhost:3456/dannet/data/synset-645

... plus many more examples found under the hyponyms of http://localhost:3456/dannet/data/synset-559

These synsets contain the same single sense, so the fact that they exist must either mean that the sense is wrong or that one of them should be deleted.

Attr-val table value appearance informed by relation name

The functions rendering the value for that row need to have the key passed down to the value rendering function, so that it can be dispatched on to modify the appearance of the value. An example might be :vann/preferredNamespacePrefix which could turn the value into a (prefix-elem ...) call.

Implement frontend using Rum

Depends on #21. Attempt to mostly reuse the same component in frontend and backend code.

The long lists of resources will not be a part of the initial HTML page, instead loading them on-demand instead when the user clicks the details element.
This will enable CSS transitions making the web page a lot more fluid, e.g. shifting background colours.

Infer `ontolex:partOfSpeech` from `wn:partOfSpeech`

Work on version 1.2 of the GWA schema is ongoing. See: globalwordnet/schemas#58 (comment)

It should be possible to infer the Ontolex triple from the GWA relation, but it will require rewriting - or perhaps uncommenting the correct section of - the Apache Jena rule DSL. Once that is done, the bootstrap code should only include the wn:partOfSpeech triple, letting Jena's inference take care of the rest.

Merge duplicate words

The DanNet dataset has at ~401 words with multiple instances:

(->> (q/run (:graph @db) op/word-clones)
     (filter (fn [{:syms [?w1 ?w2]}]
               (and (= "dn" (namespace ?w1))
                    (= "dn" (namespace ?w2)))))
     (group-by (juxt '?writtenRep '?pos))
     (count))

These words exist in the old dataset files with multiple IDs and thus result in multiple resources when creating the new DanNet graph. In the new DanNet dataset they should all be merged as part of the import procedure. Triples that reference the obsolete IDs must be modified too.

Krybception

Too many kryb:

Nathalie found this issue. Not sure whether it is something I will fix or whether it will come in as part of the overall changeds the lexicographers are deciding on.

Map DanNet->Ontolex-Lemon

Convert Princeton mappings

The mapping between the old DanNet and Princeton WordNet are quite of messy. Rather than mapping directly to the instances of one WordNet dataset, they map to two different kinds of sense IDs using the dns:eq_has_synonym relation:

"ENG20-02324230-n"
- this seems to be mapping to wn:synset-sheep-noun-1 in WordNet 2.0 as it has the synsetID 102324230.
"sheep%1:05:00::"
- this seems to have already been covered by the above...? So why does it exist?

It may be beneficial to completely ignore these relations for now and revisit them once Sussi is done mapping relations, at which point they can be properly converted and included in the dataset.

Mark inherited relations in UI

This information should be sniffed out of the RDFS comments and the appropriate rows marked as inherited. Perhaps the comments could be entirely removed in the UI? Food for thought.

Translations in Turtle format

The new DanNet schemas should come with labels/comments in both Danish and English.

In addition, the labels of the external schemas can also be translated into Danish (and published as a companion dataset). This would, as a side-effect, create a mostly Danish HTML experience.

Make full resource ID a hyperlink

e.g. http://localhost:3456/dannet/external?subject=%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Felements%2F1.1%2F%3E

We want the < http://purl.org/dc/elements/1.1/ > to be a clickable hyperlink so that it is easier for the user to visit this resource in their browser.

Alternative label language

For the individual labels, it makes sense that they are converted into select elements in cases where multiple languages exist and that the selected language becomes the default as opposed to all of the language info dissappearing from the UI entirely. For example, in some cases it might be nice to be able to see what the English label was as opposed to the Danish. This would be a relatively intuitive, hyperlocal way to inspect this data.

Implementing it entails creating a new component and passing down additional information in many places, e.g. languages and complete result sets.

Canonical lemmas for synsets

In some synsets, there are too many words to make display of them practical
—e.g. { bold1§3 • byld§3 • bær§3 • bøtte§2 • hoved§1 • kasse1§6 • knold§5 • knop§2c • nød2§3 • roe§2 • skal§6
}—when displaying in certain situations, for instance when displaying a graph visualisation or as a way to adjust what words are display directly in the UI, barring user interaction explicitly expanding the list.

By using the ordnet.dk indices as a heuristic (1§3, §2, 1§6, §2c, ...) it should be possible to limit to the words the most canonical representations. Basically, the lower the number, the better, so in the above case the canonical word(s) would be hoved as it doesn't share this low index with any other words in the synset.

Princeton Wordnet ID conversions

In the old RDF export, several thousand references to both DanNet and Princeton Wordnet IDs use a format that is not recognised by Apache Jena. In more concrete terms, attempting to import DanNet as-is results in many thousands of lines of this style of warning:

...
[nREPL-session-10db3318-811e-47e9-beec-95f4f004cc0b] WARN org.apache.jena.riot - [line: 9618, col: 114] {W107} Bad URI: <http://www.wordnet.dk/owl/instance/2009/03/instances/synset-beard%1:08:00::> Code: 30/ILLEGAL_PERCENT_ENCODING in PATH: The host component a percent occurred without two following hexadecimal digits.
[nREPL-session-10db3318-811e-47e9-beec-95f4f004cc0b] WARN org.apache.jena.riot - [line: 9619, col: 82] {W107} Bad URI: <http://www.wordnet.dk/owl/instance/2009/03/instances/synset-salon%1:06:02::> Code: 30/ILLEGAL_PERCENT_ENCODING in PATH: The host component a percent occurred without two following hexadecimal digits.
[nREPL-session-10db3318-811e-47e9-beec-95f4f004cc0b] WARN org.apache.jena.riot - [line: 9620, col: 113] {W107} Bad URI: <http://www.wordnet.dk/owl/instance/2009/03/instances/synset-salon%1:06:02::> Code: 30/ILLEGAL_PERCENT_ENCODING in PATH: The host component a percent occurred without two following hexadecimal digits.
[nREPL-session-10db3318-811e-47e9-beec-95f4f004cc0b] WARN org.apache.jena.riot - [line: 9621, col: 94] {W107} Bad URI: <http://www.wordnet.dk/owl/instance/2009/03/instances/synset-supermarket%1:06:00::> Code: 30/ILLEGAL_PERCENT_ENCODING in PATH: The host component a percent occurred without two following hexadecimal digits.
[nREPL-session-10db3318-811e-47e9-beec-95f4f004cc0b] WARN org.apache.jena.riot - [line: 9622, col: 120] {W107} Bad URI: <http://www.wordnet.dk/owl/instance/2009/03/instances/synset-supermarket%1:06:00::> Code: 30/ILLEGAL_PERCENT_ENCODING in PATH: The host component a percent occurred without two following hexadecimal digits.
[nREPL-session-10db3318-811e-47e9-beec-95f4f004cc0b] WARN org.apache.jena.riot - [line: 9623, col: 80] {W107} Bad URI: <http://www.wordnet.dk/owl/instance/2009/03/instances/synset-hall%1:06:04::> Code: 30/ILLEGAL_PERCENT_ENCODING in PATH: The host component a percent occurred without two following hexadecimal digits.
[nREPL-session-10db3318-811e-47e9-beec-95f4f004cc0b] WARN org.apache.jena.riot - [line: 9624, col: 113] {W107} Bad URI: <http://www.wordnet.dk/owl/instance/2009/03/instances/synset-hall%1:06:04::> Code: 30/ILLEGAL_PERCENT_ENCODING in PATH: The host component a percent occurred without two following hexadecimal digits.
[nREPL-session-10db3318-811e-47e9-beec-95f4f004cc0b] WARN org.apache.jena.riot - [line: 9625, col: 80] {W107} Bad URI: <http://www.wordnet.dk/owl/instance/2009/03/instances/synset-home%1:06:00::> Code: 30/ILLEGAL_PERCENT_ENCODING in PATH: The host component a percent occurred without two following hexadecimal digits.
[nREPL-session-10db3318-811e-47e9-beec-95f4f004cc0b] WARN org.apache.jena.riot - [line: 9626, col: 112] {W107} Bad URI: <http://www.wordnet.dk/owl/instance/2009/03/instances/synset-home%1:06:00::> Code: 30/ILLEGAL_PERCENT_ENCODING in PATH: The host component a percent occurred without two following hexadecimal digits.
[nREPL-session-10db3318-811e-47e9-beec-95f4f004cc0b] WARN org.apache.jena.riot - [line: 9627, col: 90] {W107} Bad URI: <http://www.wordnet.dk/owl/instance/2009/03/instances/synset-residence%1:15:00::> Code: 30/ILLEGAL_PERCENT_ENCODING in PATH: The host component a percent occurred without two following hexadecimal digits.
[nREPL-session-10db3318-811e-47e9-beec-95f4f004cc0b] WARN org.apache.jena.riot - [line: 9628, col: 117] {W107} Bad URI: <http://www.wordnet.dk/owl/instance/2009/03/instances/synset-residence%1:15:00::> Code: 30/ILLEGAL_PERCENT_ENCODING in PATH: The host component a percent occurred without two following hexadecimal digits.
[nREPL-session-10db3318-811e-47e9-beec-95f4f004cc0b] WARN org.apache.jena.riot - [line: 9629, col: 80] {W107} Bad URI: <http://www.wordnet.dk/owl/instance/2009/03/instances/synset-flat%1:06:00::> Code: 30/ILLEGAL_PERCENT_ENCODING in PATH: The host component a percent occurred without two following hexadecimal digits.
[nREPL-session-10db3318-811e-47e9-beec-95f4f004cc0b] WARN org.apache.jena.riot - [line: 9630, col: 113] {W107} Bad URI: <http://www.wordnet.dk/owl/instance/2009/03/instances/synset-flat%1:06:00::> Code: 30/ILLEGAL_PERCENT_ENCODING in PATH: The host component a percent occurred without two following hexadecimal digits.
[nREPL-session-10db3318-811e-47e9-beec-95f4f004cc0b] WARN org.apache.jena.riot - [line: 9631, col: 80] {W107} Bad URI: <http://www.wordnet.dk/owl/instance/2009/03/instances/synset-tent%1:06:00::> Code: 30/ILLEGAL_PERCENT_ENCODING in PATH: The host component a percent occurred without two following hexadecimal digits.
[nREPL-session-10db3318-811e-47e9-beec-95f4f004cc0b] WARN org.apache.jena.riot - [line: 9632, col: 113] {W107} Bad URI: <http://www.wordnet.dk/owl/instance/2009/03/instances/synset-tent%1:06:00::> Code: 30/ILLEGAL_PERCENT_ENCODING in PATH: The host component a percent occurred without two following hexadecimal digits.
[nREPL-session-10db3318-811e-47e9-beec-95f4f004cc0b] WARN org.apache.jena.riot - [line: 9633, col: 84] {W107} Bad URI: <http://www.wordnet.dk/owl/instance/2009/03/instances/synset-stable%1:06:00::> Code: 30/ILLEGAL_PERCENT_ENCODING in PATH: The host component a percent occurred without two following hexadecimal digits.
...

The issue has to do with the suffixes such as %1:06:02:: or %1:15:00:: not being valid according to the XML processor used by Jena.

According to Nicolai Hartvig Sørensen (the old maintainer of DanNet) the IDs have apparently undergone several changes since then and will have to be changed anyway. In addition, they might have been mangled in the current export, e.g. %5:00:00:rich:03 is also an example suffix.

Nicolai says the current IDs are based on this paper: https://www.aclweb.org/anthology/W11-0129.pdf

We need to come up with some type of scheme to reliably convert these old IDs to the newer IDs used by Princeton - or the Open English Wordnet in case we use that.

Some further complications:

The COR project is also producing a new set of IDs and we are nominally required to accept them in some form.
There is also a requirement by the team at DSL (the old DanNet maintainers) to maintain the connection to the data at DSL which is faciliated by the IDs present in the current version of DanNet.

RDF export

Currently, the system is bootstrapped from various existing data sources, but there is no official export.

In the exported version, the different input datasets will need to be separated entirely, necessitating a separation into separate graphs or some other means of data separation at the point of export.

Jena (SPARQL in general) has the concept of named graphs, however the way this is implemented in Jena seems to be basically as a union of separate graph objects. One major complication is the fact that the web frontend currently relies on inference to generate "missing" triples and this is done for a single "data" graph + a "schema" graph, so having multiple graphs will likely not work.

Map register to wordsenses

See: https://www.w3.org/2016/05/ontolex/#lexical-sense-reference (e.g. lexinfo:dating, dct:subject)

The register is included as part of the values in the wordsenses.csv data. This metadata can also be translated and translated into triples.

Create new DanNet ontology

The few DanNet relations that could not be represented using the GWA relations will need to be represented inside a new OWL file with a fitting namespaces. This file can probably be generated using one of these Clojure OWL libraries:

If that is not ideal, then writing them by hand is also an option.

Furthermore, Ontolex OWL (and other) files will also need to be referenced inside Apache Jena instance. I will also likely have so set up some kind of Inference for opposite relations.

Deal with blank nodes

Some inspiration here: https://www.reddit.com/r/semanticweb/comments/qke4gu/til_how_to_roundtrip_blank_nodes_in_jenafuseki/

Some RDF nodes are blank—i.e. they don't have a stable ID—and dealing with them can be troublesome. It may be possible to find a generic way to "expand" blank node content and include it with RDF resource lookups.

API design

Presumably, this will be primarily transit endpoints that pull RDF resources based on an id, but perhaps with some special queries available too.

It should have some kind of content negotiation, such that it will return transit, JSON, or maybe even HTML when applicable. The IDs should match the RDF resource IDs and the main public site should be the namespace of the new DanNet.

The API will be used by a CLJS app available at the same domain that can be used to browse and visualise the WordNet.

Transito can be used to create the communication channel. Pedestal is probably the obvious backend framework.

Entity sections dispatched on :rdf/type

Basically, there should be a :default set of sections for most entities, but for certain ones there is an alternate set of sections dispatched on the entity's :rdf/type. Multimethods are probably the way to go? Perhaps in another namespace.

Unsupported value types

See: http://localhost:3456/dannet/external?subject=%3Chttp%3A%2F%2Fwww.ontologydesignpatterns.org%2Fcp%2Fowl%2Fsemiotics.owl%3E

Annotate belonging

The above were suggested by Eric Scott (ont-app).

I would like to annotate every property, class, resources, so that it is clear to which entity it belongs. This can be used both for export purposes—although named graphs might be a better fit—as well as for browsing data and metadata.

Relation-dependent label descriptions

(this issue extends #13 and is essentially a nice-to-have)

In the current iteration of the new DanNet web presence there is an unsolved issue with certain labels being identical in result sets, making resources hard to differentiate. One example might be centrum which evokes multiple synsets and has multiple senses all with the same label, making it impossible to distinguish the list of resources.

One solution could be to have a way of obtaining and displaying additional information depending on the relation in question (or alternatively via the type, although this requires class information to be fetched in the original query). This could be a graph path, set of triples, a function, etc. which defines how to get the additional data point, e.g. for the ontolex:evokes relation the graph path would be

[:skos/definition]

while for ontolex:lexicalizedSense the graph path would be

[:lexinfo/senseExample :rdf/value]

Thus, for words the definitions and example sentences are available for decorating the listed resources. The decoration might be represented as a sublist underneath the item as this would also allow for the support of multiple returned items for the given path, e.g. multiple example sentences.

DanNet resources on the web

The DanNet-specific resources/entities should be resolvable via their URI.

Navigating to the URI should result in some form of content negotiation and the return of either a data structure (JSON, EDEN, TTL) or perhaps a basic HTML web page. The data comes directly from an instance of a TDB graph. This is in line with recommendations for linked data.

The new DanNet OWL schema should also resolve. This is a more specific case (effectively just serving static data).

OWL inferencing for Ontolex + WordNet schemas

Extracted from issue #7:

Furthermore, Ontolex OWL (and other) files will also need to be referenced inside Apache Jena instance. I will also likely have so set up some kind of Inference for opposite relations.

Automatic upgrade of dead link relations

See e.g. rdfs:isDefinedBy in http://localhost:3456/dannet/external/svs/term_status

The linked resource ends in # and this doesn't resolve to anything. However, removing the # and retrying the fetch does. One could argue that a URI with an empty # (= empty fragment string) is in fact the same as one without a # at all, so upgrading to the resource without the # at the query level should be entirely legal. However, for the odd case where the #-appended resource identifier actually resolves to something we want to return that entity instead.

It should be implemented as a basic dual search, but only in cases where the resource identifier matches this pattern and nothing is found on the first attempt.

Inference performance improvements

No matter what is decided in #10, inferencing will be a key component of the new DanNet in some way. Obviously, if the end goal is a single shared database with inferencing, performance will be extremely important, but it also is in many other cases.

Currently, the lowest level of OWL inferencing - OWL Micro - is used (specifically: OntModelSpec/OWL_MEM_MICRO_RULE_INF). This still incurs a heavy performance penalty which results in 10+ minute long queries in some cases... with subsequent queries being only a few seconds, though.

Another issue is that the inferred information is not really all that valuable, e.g. Owl:thing or some of the transitive SKOS relations, e.g.skos:narrowerTransitivewhich returns hundreds of results, and also many useless anonymous resources included in e.g. :rdf/type, presumably also through some transitive relation.

So it is in the interest of the inference model itself to make it more performant as this will also lead to much less cluttered results.

Possible Solutions

I will probably first experiment with a few of the less comprehensive built-in inference models.
If that doesn't work out I will have to build my own custom inference layer which should be possible. It is fortunately relatively well-documented.
In either case, the most important relation to infer is the opposite relations of all of the triples that go unspecified.

Migrate Hiccup to lambdaisland/hiccup

See: https://github.com/lambdaisland/hiccup

This is basically Reagent Hiccup, but on the server side. I intend this to be a first step towards turning the DanNet website into a progressive web app. Once loaded in the frontend, reagent takes over rendering.

Marl update

Apparently, the version of Marl I have cached, version 0.2, is significantly older than the current version, version 1.2. This is because the version I have cached comes from the PURL which was last updated in 2011. It seems that—although the PURL version is probably more widespread—it makes sense to use the current version from 2021 instead.

Since this newer version also has a more thorough documentation, it makes sense to see if the current representation in DanNet can be improved.