epimorphics / dclib Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 2.0 572 KB

Data Conversion library

License: Apache License 2.0

Java 100.00%

dclib's People

Contributors

Stargazers

Watchers

Forkers

terrypan swirrl

dclib's Issues

Provide away to make URI/resource nodes from `ValueArray` strings with prefix expansion.

In trying to create an array of URI for some 'static' notes I tried the following:

r_auth_notes : "{value('sf-auth-note:').append( value([ 1,2,3,4 ]) ).asRDFNode() }"

without the trailing asRDFNote one is left with a string and for some reason elsewhere template expansion of

<core:note> : "<{r_auth_notes}>"

fails to see sf-auth-note: as a URI prefix rather than a URI scheme. .asRDFNode() will apply prefix expansion but has not been written (in ValueBase) to handle arrays.

[PR with suggested changes to follow]

Multiple-values

Resource template (and thus one-offs) need a simpler way to do multiple values. The manual use of split is not appropriate for cases like multiple-fixed type values.

Conditional templates

Need richer conditional checks than just "required" column values, e.g. apply this template if this value is such and such

Non-json surface syntax

JSON is a good as to exchange the template definitions but inability to have multi-line strings and lack of comment make it an awkward syntax in which to develop templates.

Could create a simple DSL which maps directly to the JSON but side-steps those limitations.

toCleanSegement() and full-stops

I have some source values like "v.high" and "v.low". If I call toCleanSegment() on those, the full-stop character is not converted. I'm not sure if that's by-design or accidental, but it would be nice if there was at least an option to generate "v-high" as a segment value.

Template Language documentation missing or moved.

The documentation page for the template language (detailing functions etc.) seems to have disappeared :-( Not sure how I find the old version in git wiki either - but believe it must be preserved somewhere.

What I was really looking for was the variable bindings passed into a dclib template from its execution environment.

ValueArray.map(...) behaviour when individual items fail to map.

I have a situation where I'm trying to map from several fields to a media-type (see attached template, map data and sample data).

- r_mt     : "{ { var res = nullValue() ;
                  /* 'empty(res) || res.isNull()' because plain 'null' and 'ValueNull' results arise - confusing*/
                  res = (empty(res) || res.isNull()) && !empty(resource_mime_type)      ?      resource_mime_type.toLowerCase().split(',').trim().map('mediatype.map',false) : res ;
                  res = (empty(res) || res.isNull()) && !empty(resource_mimetype_inner) ? resource_mimetype_inner.toLowerCase().split(',').trim().map('mediatype.map',false) : res ;
                  res = (empty(res) || res.isNull()) && !empty(resource_format)         ?         resource_format.toLowerCase().split(',').trim().map('mediatype.map',false) : res;    
                  return res;
              } }"

Most of the entries have single values in the CSV data file. However in the first two (non-header) rows there are entries that contain comma separated values within the cell, so the transform uses split(',') to form a ValueArray as the 'input' side of the map. In the first row, not all of the values within the array map successfully (the check_mt binding reports on failed or partially failed mappings).

When ValueBase.map fails to resolve a mapping it throws a new NullResult which appear to abort further evaluation of the corresponding Pattern. In the case of a ValueArray.map the means that potentially successful mappings are lost. Hence for example triples are generated for row 2 of the attached CSV, but not for row 1 even though the first two elements do resolve in the map.

I tried an experiment that change the applyFunction on ValueArray to catch the NullResult and substitute a ValueNull for the absent result and continue. [Aside: did also try using a HashSet to collect results, but that the disturbed the ordering of results to the extend of generating test failures].

    public ValueArray applyFunction(MapValue map) {
        Value[] result = new Value[ value.length ];
        for (int i = 0; i < value.length; i++) {
        	try { 
        		result[i] = map.map( value[i]);
        	} catch (NullResult n) {
        		result[i] = new ValueNull();
				// TODO: handle exception
			}
        }
        return new ValueArray(result);
    }

However, the presence of ValueNull in the binding result cause a subsequent failure when TemplateBase.asTriple(...) validates the object node and find's null. This throws an EvalFailed exception with a message of Illegal or null RDF node result from pattern . which aborts processing for any further non-null array members. In the case of a 'simple' scalar null, the absent object value is silently skipped.

My larger concern is how to deal with the partial failure ValueArray.map(...) when the mapping of a single element in the array fails. It seems right that there should be index alignment between input and output arrays. The experiment with HashSets lost that, and allowed the output array to be smaller than the input array. Index alignment could be important when multiple related arrays are being processed.

I think the map processing should insert ValueNulls into the array (or default value if given as you suggested); and the later TemplateBase processing of ValueArray with nulls within them should probably silently ignore the null and move on to the next value in the array.

Template

resource-media-type.yaml.txt

Media-type map source

media-type.ttl.txt

Data sample

resource_media_types.csv

Remove state-wrapping requirements from Value

Passing the converter process via each value is awkward and costly, can we pass the state some other way?

Validation

In the case where a template does not apply need to be able to raise a warning message.

Use case is profile features template where lines with no feature_id should be of type SamplePoint (and will then be ignored)

Implement basic sources and mapping

Initially need

lookup in csv
rigid lookup in RDF
string match lookup in RDF

Support version flag on dclib executable

I was wondering if the dclib executable I have in my environment is up-to-date. It would be helpful if dclib -v could report the installed version.

Revised reporting structure

Clarify when errors should be silently ignored, when warning, when fatal.

Possibly reporting should be linked to the source template and only first occurrence of each error type for each template reported live, then summarise error count after completion?

Clarify error handling

Some errors such as missing data in a RDF map are silent, simply don't emit that triple. Some need to be fatal. Some need to be non-fatal but give better reporting to enable diagnosis of template problems. Whole area needs clarifying.

Language tagging

Implemented the basic language tagging support

Replace asNode machinery

The asNode() function returns a different sort of wrapped object, having this alongside the dclib Value objects is confusing. Scrap this and fold the useful parts of the RDFNodeWrapper interface directly into Value(Node)

Is it possible to get segments, but camel-cased?

I'm synthesising resource URI's by using toCleanSegment() on a field value. This gives me hyphenated local-names for resources (e.g. condition:condition-good). I'm finding that it would be easier for the code consuming the dclib output if the local-names were camel cased (so condition:conditionGood).

Is there a way to generate clean segments, but using camel-casing instead of kebabs?

hash namespace prefix expansion with empty localname looses the trailing '#'

With appropriate prefix definitions:

@prefix geo:      <http://www.w3.org/2003/01/geo/wgs84_pos#> .
@prefix skos:     <http://www.w3.org/2004/02/skos/core#> .
@prefix org:      <http://www.w3.org/ns/org#> .

the following line in a DCLIB template

"<dct:vocabulary>" : [ "<geo:>", "<skos:>", "<org:>" ],

results in the following URI (i.e. no '#') in the object position of the corresponding RDF statements:

<http://www.w3.org/2003/01/geo/wgs84_pos> .
<http://www.w3.org/2004/02/skos/core> .
<http://www.w3.org/ns/org> .

dclib x.fetch() seem to only ask for .ttl

DCLIB's x.fetch function only seems to ask for.ttl responses. Some RDF sites are capable of supplying RDF/XML but don't offer turtle, eg. companies house linked-data:

`http://business.data.gov.uk/id/company/01649776`

The following DCLIB warning arises when a .fetch() operation is performed on that URI

12:40:39.253 Warning: exception fetching http://business.data.gov.uk/id/company/01649776, org.apache.jena.atlas.web.HttpException: 406 - Unsupported content type: ttl

It would be useful if DCLIB made requests that also accept RDF/XML responses (may be it does already, in which case it may be an issue with the way this particular site responds).

Inline creation of structured object values

Can be achieved by referenced templates but in-line templates might be more natural.

Inline prefix declaration

The set of available URI prefixes is currently defined out of band (e.g. defaultPrefixes.ttl in the command line tool). Should be possible to define them in-line in a (Composite) template.

Other scripting languages

jexl is reasonable expressive but is not a full general purpose language.

Support escaping to ruby or javascript?

Round to return wrapped value

Separate jar and standalone artifacts

Need to find a way to build a one-jar assembly separate from the library jar so can continue to support command line version while reusing the code as a library.

Trying to generate tel: URI generates an error

vcard:tel (or vcard:hasTelephone) in an owl:ObjectProperty that is expected to have an object value which is usually a tel: URI.

However a template line of:

        "<vcard:tel>"                  : "<{value('tel:').append(phone_number.asString())}>"

or more simply:

        "<vcard:tel>"                  : "<tel:{phone_number}>"

both result in errors of the form:

16:02:12.597 Warning: template contact applied but failed: com.epimorphics.dclib.framework.EvalFailed: Looks like unexpanded prefix in URI: tel:01223456789 [766]

Coordinate Conversion

It would be useful to be able to easily generate coordinate conversions of the form:

xval.coordTransformX([fromCRS],[toCRS]. [fromX], [fromY]
yval.coordTransformY([fromCRS],[toCRS]. [fromX], [fromY]

eg from OSGB to WGS84

northing.coordTransformY('epsg:27700', 'epsg:4326', northing, easting)
easting.coordTransformX('epsg:27700', 'epsg:4326', northing, easting)

Java libraries like GeoTools contain the means to build coordinate transforms between applicable CRS's i.e. CRS's that are relevant in the same geography.

source type constraints

Do map source type constraints apply to the entity that's resolved whose property (or identity) is returned; or to the returned property.

Just noting the question because so far I've had no success in using "type" to create a more selective mapping - I've managed to make them over selective - ie. no results. Quite likely user error.

Check prefixes

Currently unexpanded prefixes will be silently emitted into the data with bad non-http URIs. Need automatic warnings for this case.

$row functions don't seem to return wrapped Values

The methods in Row.java (package com.epimorphics.dclib.values) don't return wrapped values. Can work-around with the global value function.

Support for fetching reference data from remote services

To support BWQ sites linking

Array valued 'bind'

DCLIB doesn't currently support array valued 'bind' template variable bindings, however in valu templates it accepts array valued expressions, transforming them as multi-valied properties.

It would be useful, and more uniform, to be able to create array valued variable bindings within a template eg to provide common set of vocabulary URI for VoID metadata (in this case preprocessed to remove trailing '#' characters from namespace URI.

"bind" : [ { "$ns" : "{= asResource($$).replaceAll('#$','') }"
              },{
           "    $vocabularies"    : [ "<{$ns.apply('dgu:')}>",
                                            "<{$ns.apply('def-bwq:')}>",
                                            "<{$ns.apply('skos:')}>",
                                            "<{$ns.apply('def-bw:')}>",
                                            "<{$ns.apply('def-sp:')}>",
                                            "<{$ns.apply('def-stp:')}>",
                                             "<{$ns.apply('qb:')}>" ]
              } ]

Syntax checking and IDE help

While following @skwlilac writing and updating templates, it struck me that there are quite a few ways to get things wrong and that we could look at tools to help us write well-formed templates.

Things like the Language Server Protocol along with YAML / JSON schemas, the "Anything Markup Language", SchemaStore and lots of others can be used to help authors and highlight errors.

Add support to create ValueArrays with the global value function

Within JEXL it is useful possible to create both collections (HashMaps and HashSets) and array which are useful when micro-parsing table cell values that contain some form of repetitive structure. It is desirable to pass back an array of the resulting values as a ValueArray to a template as a means passing a multi-valued result.

  var res = { 'one':'one' }                     // Create a Map
  res.clear() ;                                       // Empty the Map
  ....                                                     // Work to fill the map - assume that map keys are ValueNodes representing RDF nodes.
  return value(res.keySet())

PR proposing a fix to follow.

.Map() Type based selection: Disjunctive selection on multiple types.

It would be useful to be able to define sources (for .map()) where any "type" constraint can take a disjunctive set of values. For example, in mapping district names into admin geography URIs it would be useful to be able to say something like:

type : [ "<admingeo:UnitaryAuthority>"  "<admingeo:District>" 
    "<admingeo:MetropolitanBorough>" "<admingeo:LondonBorough>"
    "<admingeo:CivilAdministrativeArea>" ]

Currently an unconstrained type search leads to some unexpected results and constraints on a single type is overconstraining. One could work around the latter with a cacade of conditional assignments in JeXL that make probe for a result and use the first non-null result obtained - but that is very tedious.

Carriage Return 0x0d rewritten in cell values

We've had a few source CSV files with character sequence CR CR LF inside cells, used to denote a single line break. The resulting sequence in dclib — i.e. when dealing with values in templates — is LF LF, essentially doubling the line breaks. This becomes a problem when users are trying to separate paragraphs in a description, for instance, where the usual practice is to use a double line break to separate paragraphs, which comes out as 4 LF characters and often gets turned into two <br />s in the resulting HTML.

@skwlilac and I tracked this down to the CSV parser opencsv (version 2.3) which underneath uses Java's BufferedReader to iterate over lines of text, where lines are delimited by either CR, LF, or CR LF. As far as the CSV parser is concerned, if a line ends in the middle of a quoted cell, then it adds back a LF and continues reading the value.

This dependency comes from lib version 2.0.0, which in turn comes from appbase 2.0.0.

opencsv looks to have had a number of forks and owners over the intervening years, but does now say that it deals properly with CR characters in values

While this character sequence shouldn't be being used in the first place, I'd argue that the parser shouldn't be interpreting characters in cell values and should pass things through verbatim for processing within templates.

Support for reference time data

C.f. existing BWQ data converters

.Map() based reconciliation: would like to retrieve multiple related properties

When reconciling data using .map() it would be useful to copy more than one property or attribute from the reconciled data into the generated output. This is particularly the case for RDF data where in addition to reconciling say a local authority name against an admin geography to reconcile a reference URI for the local authority (or the district it administers) it is generally useful to include rdf:type, rdfs:label and/or skos:prefLabel values for that entity in the generated output.

I've worked around this by creating multiple maps into the same data that return different values for different properties. However, that risks some uncertainty that there returned property values all related to the same entity and also has the (current) limitation that only one value for the given property can be obtained.

One-off to sniff first line?

It seems to often be the case that we have to generated "one-off" dataset descriptions which depend on some uniform columns in the CSV (e.g. "parent department" for organograms) or to generate an example entry.

We could execute one-off templates in an environment where the column variables are bound to a preview set of values taken from the first non-heading line of the CSV

JSON Library?

The current manual processing JSON using the generic json representation is a bit primitive.

Consider switching to Jackson data binding to simplify and clarify the code.

OTOH the current parsing works and allows a lot more flexibility in the template design than a data bind system would.

Maybe switching to Jackson but mostly using the generic DOM-like interface would be the best. That would allow use of data binding where appropriate in extensions without require redesign of stuff that already works.

Extraneous prefixes emitted with no (obvious) way to remove them

Taking a minimal template:

name: prefixes-test
type: Composite

and minimal source data:

t1,t2
1,2

The output from dclib prefixes-test.yaml source.csv is:

@prefix void:  <http://rdfs.org/ns/void#> .
@prefix owl:   <http://www.w3.org/2002/07/owl#> .
@prefix org:   <http://www.w3.org/ns/org#> .
@prefix xsd:   <http://www.w3.org/2001/XMLSchema#> .
@prefix skos:  <http://www.w3.org/2004/02/skos/core#> .
@prefix rdfs:  <http://www.w3.org/2000/01/rdf-schema#> .
@prefix version: <http://purl.org/linked-data/version#> .
@prefix qb:    <http://purl.org/linked-data/cube#> .
@prefix at:    <http://environment.data.gov.uk/public-register/def/applicant-type/> .
@prefix dct:   <http://purl.org/dc/terms/> .
@prefix rdf:   <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix reg:   <http://purl.org/linked-data/registry#> .
@prefix time:  <http://www.w3.org/2006/time#> .
@prefix api:   <http://purl.org/linked-data/api/vocab#> .
@prefix prov:  <http://www.w3.org/ns/prov#> .
@prefix foaf:  <http://xmlns.com/foaf/0.1/> .
@prefix dc:    <http://purl.org/dc/elements/1.1/> .

Some of these are fine in general use, but some are less commonly needed (e.g. reg, qb) and some are not really needed at all except in very specific circumstances (e.g. at).

As far as I can tell, setting the prefixes option in the template file adds to the set of known prefixes, but does not remove any of the default set.

It would be nice if there was a mechanism (e.g. by command line option or a config setting) to suppress the built-in default prefixes, and only use those specified in the template

I'm currently using DCLIb 2.1.2-SNAPSHOT

Improve progress indicator

Measure data size at start and give % progress on monitor updates.

Measure data size could be brute force read through line counting, or could be size-size/line-estimate based on probing size of first few lines?