GithubHelp home page GithubHelp logo

swirrl / datahost-prototypes Goto Github PK

View Code? Open in Web Editor NEW
0.0 0.0 0.0 26.9 MB

License: Eclipse Public License 1.0

Clojure 77.04% HCL 4.77% Shell 0.33% Python 1.66% JavaScript 1.11% HTML 14.21% CSS 0.85% Dockerfile 0.04%

datahost-prototypes's People

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

datahost-prototypes's Issues

Demonstrate access to arbitrary RDF predicates

In a previous WIP prototype, called cubiql 2, we showed how we can support some of the extensibility of RDF schemas within a relatively fixed graphql schema, by the inclusion of specific property path fields at key points in the schema.

For example cubiql-2 supported this query:

{
  endpoint {
    datacube(id:"http://covid/infections") {
      id
      observations {
        area_label: property_path_string(path:["sdmxd:refArea" "rdfs:label"])
      }
    }
  }
}

The property_path_XXXX fields encoded within their names the expected return types of those fields and would cast accordingly. We included the following set of fields:

  • property_path_literal_query
  • property_path_uri_query
  • property_path_int_query
  • property_path_float_query
  • property_path_string_query
  • property_path_lang_string_query
  • property_path_datetime

As these fields take an array it is possible to provide depth limits on the length of the path, without changing the schema.

The property path fields in cubiql-2 would also work with the default prefixes, but also an extended set of prefixes which could be provided in a parent graphql resolver.

It would be nice to selectively demonstrate this in this prototype in particular on the global object identification interface.

There will always be performance costs associated with this, so controling the context in which this feature is available and can be optimised is important, along with the depth of the paths supported in each context.

Improve READMEs

We need to communicate the intention of the prototypes, so we need to improve the README with

  1. The rationale for the prototypes
  2. Some examples of what they support
  3. How to use them programmatically / access their docs
  4. Description of how to get the builds
  5. A guide to the API and data model

Support client side js users by setting CORS headers

Currently you get errors like:

Access to fetch at 'http://localhost:8888/api' from origin 'http://localhost:8080/' has been blocked by CORS policy: Response to preflight request doesn't pass access control check: No 'Access-Control-Allow-Origin' header is present on the requested resource. If an opaque response serves your needs, set the request's mode to 'no-cors' to fetch the resource with CORS disabled.

NOTE: we will also need to fix #32 and move to HTTPS to prevent mixed content errors to support client side js API access fully.

Create a stable set of fixture data for tests

The prototype currently works off the live database, which is continually being changed.

We should extract a minimal/modest set of fixture data from the beta site by running some construct queries, and then load those fixtures into an in memory (or RDF4j native store) SPARQL repo.

Make catql-prototype repo a mono repo containing a new datahost-ld-api service

The point here is to turn the catql-prototype repo into a mono repo where we have CI and docker setup for two app builds (the graphql prototype and the new linked data API prototype).

  1. Move the catql service into a sub directory of its repo named datahost-graphql as the first step in turning this repo into a monorepo (containing two prototype services, not one)
  2. Ensure it still builds and deploys (i.e. fix CI to run the tests/build for the service which was moved from the root of the repo down to datahost-graphql)
  3. Copy the datahost-graphql project into the datahost-ld-openapi (so we get the same project structure)
  4. Rename appropriately so it tests and deploys docker containers to datahost-ld-openapi (though at this point the implementation will be a copy of the graphql service)
  5. Gut the implementation and rename the main namespace to tpximpact.datahost.ldapi
  6. Replace the pedestal service with a simple ring based hello world handler.
  7. Replace the tests to test that it returns 200 OK Hello World
  8. Deploy to a Google docker optimised instance

New routes

Add routes for:

  • PUT /data/:series/:release

  • GET /data/:series/:release

  • POST /data/:series/:release/revision

  • ...

  • POST /data/:series/:release/schema

Inverse triples

Due to the document centric nature of some of the RDF here it would be nice to find a way to augment some of those documents with the inverse triples of key predicates... e.g.

  my-release dcat:inSeries my-series => my-release dcat:seriesMember my-series

This could in principle be done either on write or on read.

Change schema: Change endpoint_id to draftset_id

Currently endpoint_id can be any SPARQL endpoint. We should rename this parameter to be draftset_id to prevent any assumptions that it works on arbitrary 3rd party endpoints. draftset-id will then at some point in the future be a string indicating the draftset you are querying against.

If this parameter isn't provided we should default to the public endpoint.

GraphQL facet queries that don't return the `datasets` field appear to hang forever

e.g. this works:

{
  endpoint {
    catalog {
      id
      catalog_query {
        datasets {
          id
        }
        facets {
          publishers {
            id
            label
          }
          creators {
            id
            label
          }
	 themes {
            id
            label
          }          
        }
      }
    }
  }
}

But removing the datasets sub query making this query, causes it to hang. I strongly suspect this is because of the promise delivery between these two resolvers; and I think we're accidentally expecting the datasets field and resolver to always be present; where as the fields can be independent of each other.

i.e. just because the data dependency to compute the enabled flag may exist from facets -> datasets it doesn't imply that the fields are always present in the query, and we're relying on the fields to execute the resolvers.

Here's the hanging query:

{
  endpoint {
    catalog {
      id
      catalog_query {
        facets {
          publishers {
            id
            label
          }
          creators {
            id
            label
          }
	 themes {
            id
            label
          }          
        }
      }
    }
  }
}

Change schema: Make catalog default catalog-uri optional in schema

The schema currently requires that you provide a valid catalog URI e.g:

{
  endpoint {
    catalog(id: "http://gss-data.org.uk/catalog/datasets") {
      id
    }
  }
}

We should change it so this is optional, and that it will default to the hardcoded value above:

{
  endpoint {
    catalog {
      id
    }
  }
}

There's currently no need to make this configurable; though we might as well lift it into the .edn config as a hardcoded value, so it's closer to being configurable.

Associate a subdomain with datahost-ld service

Discuss and decide on domain name strategy and whether we need to wire up a proxy pass through an nginx etc

(Probably don't need to do it for the prototype; but we need to have or start formulating a plan for the how the routes will map)

Cache and expire SPARQL results

Some SPARQL queries are going to return only a modest size of results and will rarely be updated. We should look at caching the results of these with core.cache or an alternative, and using information revealed by drafter to expire caches when new data is published.

Integrate reitit duratom & series

This task is to integrate the three strands into one, such that the series work is exposed via openapi 3 and reitit, and persistently stores data with duratom.

https://github.com/Swirrl/datahost-prototypes/blob/56f8c4508046b9f9abf19f1ac5dba33741de506c/datahost-ld-openapi/src/tpximpact/datahost/scratch/series.clj#L199C1-L229

Other things to do:

  1. Wire up duratom with the integrant config
  2. coercion of json (we want keys as strings)

Definition of done:

If you do a PUT /data/my-dataset-series of the appropriate content then you should be able to GET /data/my-dataset-series and get that document back out.

Inject dcterms:issued to all jsonld documents on first insert

Right now we only have series implemented so this issue is just to inject dcterms:issued into those on create. The relevant model code is in datahost/ldapi/models/series.clj. Creates should be atomic, i.e. inject the timestamp before the db update is attempted, so either the whole jsonld document (with the issued timestamp) gets inserted or not. This could be implemented in such a way that the code to inject the timestamp could be re-used for other models (we will also be adding the timestamp to e.g. releases and revisions once those are implemented).

This issue is done when all new series include a dcterms:issued key that reflects the time the series was created.

Add initial project dependencies we're going to use

Tech choices for prototype:

  • grafter
  • RDF4j native store (available via grafter.repository)
  • reitit (with openapi) a complication is we may need to deploy a build from the master branch to dev against this
  • flint for SPARQLing
  • Whatever the best JSONLD option is (remind ourselves) -- can probably get away with just using RDF4j's serialiser/deserialiser and a JSON parser though.

LD API: Mount a docker volume for native RDF store

Mount a docker volume for the LD API service in the production environment to store the database across application restarts.

Though perhaps we should be ok with it being transitory and lost on reboot during the prototype stage, and we should tackle this only when need persistence over restarts, and when we need that we should use a real DB server e.g. Jena/Stardog.

Review key properties for model and define their behaviour in our profile and add them

We've currently focussed on explicitly supporting an absolute bare minimum amount of metadata, e.g. dct:title dct:description and not much else beyond predicates to tie our model together.

We should review all the entities and the metadata terms that can be used on them...

e.g. For all entities we should probably add a managed predicate of dct:issued (which cannot be changed once set), whilst dct:modified should be managed and set on every upsert.

Setup Continuous Deployment for the graphql-prototype

Idea is to start simple and get something going ASAP.

The current process is that we deploy docker containers for the datahost-graphql prototype to our public docker registry, and then start them through google container optimised images.

The idea is to automate this workflow such that our CI becomes CI/CD, where deployments happen on every merge to the main branch (or on a tag).

As the above prototype is currently stateless (or the state is supplied by the beta.gss-data.org.uk public sparql endpoint, we can kick the can down the road on some of the harder deployment questions.

In the short term deploys don't need to be downtimeless.

Some other ideas:

  1. It would be good to do this with terraform/IAC.
  2. If possible can we use a container that contains the terraform environment, that way circle can drive the container to do the deployment, and we can repeatably use that (and debug locally)
  3. We should do all of this in a new green field GCP project, to avoid any risk to other environments.

Once this is done, we can look at applying this same process to the datahost-ldopenapi prototype. At that point this process will likely also need some state deployed with (so will require volumes mounted with data on) and we'll need an initial provisioning phase followed by a more routine CI/CD driven 'redeploy version' phase.

Deploy ld-openapi-prototype to GCP infrastructure

We haven't actually tried deploying this yet.

I expect it's 99% there; but suspect there may be a few small gotcha's... e.g. I think the ring prototype binds to localhost and I think we need to change it to bind to 0.0.0.0 for it to work in the docker environment.

This ticket is basically to flush out any issues like this, by testing the deployment ASAP.

Implement faceted search properly

There are some inconsistencies in the existing faceting behaviour.

We have captured what the new behaviour should be in failing tests in PR #31, and merged them as pending this issue being resolved. They are flagged with ^:kaocha/pending metadata so they don't currently break the build, and indicate functionality we've not yet implemented properly.

Some of the inconsisties/issues with faceting are also listed below:

1. Enabled status on unconstrained query

This query:

{
  endpoint {
    catalog(id: "http://gss-data.org.uk/catalog/datasets") {
      id
      catalog_query {
        datasets {
          id
          publisher
        }
        facets {
          themes {
            id
            enabled
          }
          creators {
            id
            enabled
            label
          }
          publishers {
            id
            enabled
          }
        }
      }
    }
  }
}

​Says all the facets are enabled = false, when I think they should all be true (if actually used by the data).

2. Locking a single facet value shouldn't disable other selections within that facet

See this query:

{
  endpoint {
    catalog(id: "http://gss-data.org.uk/catalog/datasets") {
      id
      catalog_query(themes: ["http://gss-data.org.uk/def/gdp#trade"]) {
        datasets {
          id
          publisher
        }
        facets {
          themes {
            id
            enabled
          }
          creators {
            id
            enabled
            label
          }
          publishers {
            id
            enabled
          }
        }
      }
    }
  }
}

I think in the results for this query many themes should be enabled, as most options will expand the amount of query results. For example adding the balanceofpayments theme to the list results increases the results from ~7 to ~20.

It might be nice to disable options which don't grow the results, though at the minute I don't think that is necessary.

3. Locking a single facet value should enable appropriate selections in other facets

So for example in a variant of our original query:

{
  endpoint {
    catalog(id: "http://gss-data.org.uk/catalog/datasets") {
      id
      catalog_query(themes:["http://gss-data.org.uk/def/gdp#trade"]) {
        datasets {
          id
          label
        }
        facets {
          themes {
            id
          }
          publishers {
            id
            enabled
          }
        }
      }
    }
  }
}

the publishers returned shouldn't all be false, but should implicitly include as enabled publishers who published data into the trade theme, e.g.

So the above query is inconsistent with this one which locks both theme = trade and creator = hmrc as it has results.

I think enabled is only set to true at the minute if you include that facet in your query... i.e. I think you're thinking of it as whether the checkbox has been ticked; where as I'm thinking of it as being whether you should enable/show or disable/hide that checkbox from the user.

POST | GET Changeset

This should probably follow a POST REDIRECT GET pattern.

# Appending a ChangeSet
#
# First PRG pattern to create a new ChangeSet:
#
# POST : /data/life-expectancy/changesets
# CONTROLLER: /data/:series/changesets
# Server responds:
#
# 303 /data/life-expectancy/release/2023/changesets/:auto-increment-changeset-id
#
# API call then creates this resource in the database:
</data/life-expectancy/changesets/1>
a datahost:ChangeSet,
datahost:InitialChangeSet ; # only on the first changeset in the sequence
# datahost:previousChangeSet </data/life-expectancy/data/0> # in the case of subsequent changesets ;
dcterms:issued "T4"
.

POST | GET schema's to join to releases

We should review the scratch schema code and integrate it as rest API routes.

Ultimately we may want to support many schemas being associated with a single release. However in this iteration of the prototype we can assume just one.

Multiple schemas may not be necessary; but they may support more cleanly allowing people to incrementally add commitments, or layer on other concerns (e.g. URI generation) to existing datasets (providing all data within that release historically stills conforms to the extra schema).


Implement POST / REDIRECT / GET pattern for creating schemas.

Route:

POST /data/:series-slug/release/:release-slug/schemas Accept: json+ld

BODY
{"dh:columns" [{"csvw:datatype" "string"
                "csvw:name" "foo_bar"
                "csvw:titles" ["Foo Bar"]}]}


Note normalising the incoming request body should:

  • Add "@type" "dh:TableSchema" as a managed param to the table schema
  • Add "@type" "dh:DimensionColumn" to each column
  • Add "datahost:appliesToRelease" "</data/:series-slug/release/:release-slug>"
  • Add "appropriate-csvw:modeling-of-dialect" "UTF-8,RFC4180"

Server responds:
303 /data/:series-slug/release/:release-slug/schemas/:auto-incrementing-schema-id

Route:

GET /data/:series-slug/release/:release-slug/schemas Accept: json+ld

Note the release will also get the inverse triple:

</data/:series-slug/release/:release-slug> datahost:hasSchema </data/:series-slug/release/:releases-slug/schemas/:schema-id>

Add draftset support (and revisit how endpoint will be changed)

Current schema looks like this:

{
  endpoint(draftset_id: "xyz") {
    endpoint_id
  }
}

However switching the endpoint via a draftset_id parameter probably isn't the right way to do this, as it won't play well with the GraphQL Global Object Identification pattern which recommends having a get(id:String) field at the root of the schema, so for us draftset selection will need to be "over this".

Hence we should support a way of passing the draftset on the request.

Regardless I think endpoint metadata can stay in the schema, so this is really about moving where the draftset_id flag is passed.

Setup a simple RDF4j native store to act as SPARQL backend

This avoids us having to worry about provisioning an extra database service for the prototype, we can run it in process.

We should wire it up to programmatically load some background data into a single graph on startup. We might want to copy some functionality from the muttnik loader around hashing to replace the background data if it is different, to what is already there. Initially the background data doesn't need to contain much more than:

@base <http://www.example.org/> .

</graphs/background-data> {
  # Some background catalog metadata
  example:datasets
      a dcat:Catalog ;
      rdfs:label "Datasets" .
} 

Other things we need to do (at some point TBD):

Mount a docker volume for this service: #44

PUT | GET | DELETE Release route

Add the route /data/:series/release/:release-slug mirroring the patterns established in the series route.

This ticket can be DONE when users can upload a json-ld of release metadata to the route and get it back out again.

Improvements to this area will be made in a future ticket #76 , where we add new representations to this resource for different content-types after the changesets/commits routes/model is complete.

Sync triples from documents into triple store

  • remove duratom
  • (de)serialise json-ld as RDF on the fly, storing everything in only the triplestore

The documents are currently stored in duratom which maintains the same interface as a clojure atom.

All clojure reference types along with the duratom library we're using support add-watch.

So an easy way to sync and serialise changes in the documents to RDF is by calling our ednld->rdf on the document and writing the triples to the native-store triplestore through a watcher added to the duratom.

If the triplestore write fails, we should retract the changes to the duratom too.

Is this safe?

Short answer yes.

Medium answer: Yes, safe enough for now (eventually consistent)... though in a production system we would additional consistency checks to ensure eventual consistency, and a method for handling partial failure. The duratom will witness changes before the triple store, so we need to assume that; but as writes to the duratom and triplestore are synchronous to the initial blocking request, providing a user blocks on their requests future requests should appear consistent.

Long answer for people who spotted one potential gotcha:

The docstring for add-watch warns that the watcher may be called from multiple threads; and additionally watches on clojure atoms don't guarantee order semantics which would be bad as there'd be a risk of the databases containing different states.

However I have looked and it seems that duratom does protect against this, by wrapping the underlying atom in a lock (and relying on the fact that atom's synchronously block on their watches before returning from swap!).

I've also confirmed this behaviour at a REPL:

(def a (db/duratom :local-file :file-path "/tmp/test-duratom.edn" :commit-mode :sync :init 0 ))
(def b (atom []))
(add-watch a :watcher (fn [k ref old new] (swap! b conj [old new])))

(dotimes [n 100000] (future (swap! a inc)))

(apply < (map first @b)) ;; => true

The true here indicates that all starting (prior to inc) states were linearised.

However if you repeat the experiment with a normal atom you may witness the out of order behaviour mentioned:

(def a (atom 0))
(def b (atom []))
(add-watch a :watcher (fn [k ref old new] (swap! b conj [old new])))

(dotimes [n 100000] (future (swap! a inc)))
(apply < (map first @b)) ;; => false

Rename DatasetSeries

We should probably find a more specific name for dh:DatasetSeries, and rename it everywhere in the API.

The problem with DatasetSeries is that the dcat terminology is quite broad:

Dataset series are defined in [ISO-19115] as a "collection of datasets […] sharing common characteristics". However, their use is not limited to geospatial data, although in other domains they can be named differently (e.g., time series, data slices) and defined more or less strictly (see, e.g., the notion of "dataset slice" in [VOCAB-DATA-CUBE]).

Our meaning of a DatasetSeries is that it represents the name of a "dataset" which is stable over time, regardless of releases, revisions or schema and methodology changes. It is Heraclituses river, through which change flows. Examples are datasets like "The census" or "Indicies of Multiple Deprivation" which have seen various schema and methodology changes overtime and aren't necessarily directly comparable with previous releases. For us the series contains "releases" which in turn have stable schema, and contain "revisions". So for us the series allows for the representation of a series of "breaking changes" under the same "name", whilst still preserving all the notions of stability we care about in releases/revisions below the series level.

However most users probably think of a series as a "time series", which for us would be a cube with a stable schema and a time dimension.

Hence I think it would be worth us specialising dcat:DatasetSeries with a subclass with a narrower name, my favourite so far is

dh:DatasetIdentity rdfs:subClassOf dcat:DatasetSeries .

Update graphql terraform to use a fixed ip so we can use DNS

Though we have done a large part of continuous deployment in #33 it is in a new GCP project, and the old prototype which has DNS setup is pointing to a fixed IP allocated in that project.

We can then do #91 and point that DNS to a new fixed IP which is in the new terraform deployment; at which point deploys will be fully automated for the prototype graphql service.

`enabled` field in facet results does not respect `search_string`

I've just noticed a bug in the implementation of facet results introduced in #63, in that it returns incorrect results in the presence of a search_string.

I suspect this is because it is calculating the facets enabled status before filtering by the search string.

Example queries to show the issue:

This query locks the trade theme and searches within it for a string of "hmrc":

{
  endpoint {
    catalog {
      catalog_query(search_string:"hmrc"
         themes:["http://gss-data.org.uk/def/gdp#trade"]
      ) {
        datasets {
          title
          creator

        }
        facets {
          creators {
            enabled
            id
          }
          themes {
            enabled
            id
          }
        }
      }
    }
  }
}

The elided results below first show all datasets in the results have a creator of hmrc as expected:

          "datasets": [
            {
              "creator": "https://www.gov.uk/government/organisations/hm-revenue-customs"
            },
            {
              "creator": "https://www.gov.uk/government/organisations/hm-revenue-customs"
            },
            {
              "creator": "https://www.gov.uk/government/organisations/hm-revenue-customs"
            },
            {
              "creator": "https://www.gov.uk/government/organisations/hm-revenue-customs"
            },
            {
              "creator": "https://www.gov.uk/government/organisations/hm-revenue-customs"
            }
          ],

However in the facets branch for creator we see:

...
              {
                "enabled": true,
                "id": "https://www.gov.uk/government/organisations/hm-revenue-customs"
              },
...
              {
                "enabled": true,
                "id": "https://www.gov.uk/government/organisations/department-for-digital-culture-media-sport"
              },
...

Which is wrong, because the only enabled facet should be https://www.gov.uk/government/organisations/hm-revenue-customs.

Removing the search string returns the expected distinct two creators :

            {
              "creator": "https://www.gov.uk/government/organisations/department-for-digital-culture-media-sport"
            },
            {
              "creator": "https://www.gov.uk/government/organisations/hm-revenue-customs"
            },
...

So it looks like we're just not applying the search string to the facets.

Inject dcterms:modified to all jsonld documents on put

Right now we only have series implemented so this issue is just to inject dcterms:modified into those on update. The relevant model code is in datahost/ldapi/models/series.clj. Creates should be atomic, i.e. inject the timestamp before the db update is attempted, so either the whole jsonld document (with the modified timestamp) gets inserted or not. This could be implemented in such a way that the code to inject the timestamp could be re-used for other models (we will also be adding the timestamp to e.g. releases and revisions once those are implemented).

This issue is done when all updated series include a dcterms:modified key that reflects the time the series was updated.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.