sparql-anything / sparql.anything Goto Github PK

View Code? Open in Web Editor NEW

187.0 14.0 11.0 33.85 MB

SPARQL Anything is a system for Semantic Web re-engineering that allows users to ... query anything with SPARQL.

Home Page: https://sparql-anything.cc/

License: Apache License 2.0

Java 98.54% ANTLR 0.06% HTML 0.88% TeX 0.03% Shell 0.20% Dockerfile 0.29%

sparql semantic-web rdf json knowledge-graph-construction linked-data xml csv

sparql.anything's Introduction

SPARQL Anything

SPARQL Anything is a system for Semantic Web re-engineering that allows users to ... query anything with SPARQL.

Main features:

Provides a homogenous view over heterogeneous data sources, thanks to the Facade-X meta-model (see Facade-X specification )
Query files in plain SPARQL 1.1, via the SERVICE <x-sparql-anything:> (see configuration) and build knowledge graphs with CONSTRUCT queries
Supported formats: XML, JSON, CSV, HTML, Excel, Text, Binary, EXIF, File System, Zip/Tar, Markdown, YAML, Bibtex, DOCx, PPTX (see pages dedicated to single formats)
Transforms files, inline content, or the output of an external command
Generates RDF, RDF-Star, and tabular data (thanks to SPARQL)
Full-fledged HTTP client to query Web APIs (headers, authentication, all methods supported)
Functions library for RDF sequences, strings, hashes, easy entity building, ...
Combine multiple SERVICE clauses into complex data integration queries (thanks to SPARQL)
Query templates (using BASIL variables)
Save and reuse SPARQL Results Sets as input for parametric queries
Slice large CSV, JSON and XML files with an iterator-like execution style ( see #202 and #203)
Supports an on-disk option (with Apache Jena TDB2)

Quickstart

SPARQL Anything uses a single generic abstraction for all data source formats called Facade-X.

Facade-X

Facade-X is a simplistic meta-model used by SPARQL Anything transformers to generate RDF data from diverse data sources. Intuitively, Facade-X uses a subset of RDF as a general approach to represent the source content as-it-is but in RDF. The model combines two types of elements: containers and literals. Facade-X always has a single root container. Container members are a combination of key-value pairs, where keys are either RDF properties or container membership properties. Instead, values can be either RDF literals or other containers. This is a generic example of a Facade-X data object (more examples below):

@prefix fx: <http://sparql.xyz/facade-x/ns/> .
@prefix xyz: <http://sparql.xyz/facade-x/data/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
[] a fx:root ; rdf:_1 [
    xyz:someKey "some value" ;
    rdf:_1 "another value with unspecified key" ;
    rdf:_2 [
        rdf:type xyz:MyType ;
        rdf:_1 "another value"
    ]
] .

More details on the Facade-X metamodel can be found here.

Querying anything

SPARQL Anything extends the Apache Jena ARQ processors by overloading the SERVICE operator, as in the following example:

Suppose having this JSON file as input (also available at https://sparql-anything.cc/example1.json)

[
  {
    "name": "Friends",
    "genres": [
      "Comedy",
      "Romance"
    ],
    "language": "English",
    "status": "Ended",
    "premiered": "1994-09-22",
    "summary": "Follows the personal and professional lives of six twenty to thirty-something-year-old friends living in Manhattan.",
    "stars": [
      "Jennifer Aniston",
      "Courteney Cox",
      "Lisa Kudrow",
      "Matt LeBlanc",
      "Matthew Perry",
      "David Schwimmer"
    ]
  },
  {
    "name": "Cougar Town",
    "genres": [
      "Comedy",
      "Romance"
    ],
    "language": "English",
    "status": "Ended",
    "premiered": "2009-09-23",
    "summary": "Jules is a recently divorced mother who has to face the unkind realities of dating in a world obsessed with beauty and youth. As she becomes older, she starts discovering herself.",
    "stars": [
      "Courteney Cox",
      "David Arquette",
      "Bill Lawrence",
      "Linda Videtti Figueiredo",
      "Blake McCormick"
    ]
  }
]

With SPARQL Anything you can select the TV series starring "Courteney Cox" with the SPARQL query

PREFIX xyz: <http://sparql.xyz/facade-x/data/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX fx: <http://sparql.xyz/facade-x/ns/>

SELECT ?seriesName
WHERE {

    SERVICE <x-sparql-anything:https://sparql-anything.cc/example1.json> {
        ?tvSeries xyz:name ?seriesName .
        ?tvSeries xyz:stars ?star .
        ?star fx:anySlot "Courteney Cox" .
    }

}

and get this result without caring of transforming JSON to RDF.

seriesName
"Cougar Town"
"Friends"

Using the Command Line Interface

SPARQL Anything requires Java >= 11 to be installed in your operating system. Download the latest version of the SPARQL Anything command line from the releases page. The command line is a file named sparql-anything-<version>.jar. Prepare a file with the query above and name it, for example query.sparql. The query can be executed as follows:

java -jar sparql-anything-<version>.jar -q query.sparql

See the usage section for details on the command line interface.

Using the server

SPARQL Anything is also released as a server, embedded into an instance of the Apache Jena Fuseki server. The server requires Java >= 11 to be installed in your operating system. Download the latest version of the SPARQL Anything server from the releases page. The command line is a file named sparql-anything-server-<version>.jar.

Run the server as follows:

$ java -jar sparql-anything-server-<version>.jar 
[main] INFO io.github.sparqlanything.fuseki.Endpoint - sparql.anything endpoint
[main] INFO io.github.sparqlanything.fuseki.Endpoint - Starting sparql.anything endpoint..
[main] INFO io.github.sparqlanything.fuseki.Endpoint - The server will be listening on http://localhost:3000/sparql.anything
[main] INFO io.github.sparqlanything.fuseki.Endpoint - The server will be available on http://localhost:3000/sparql
[main] INFO org.eclipse.jetty.server.Server - jetty-10.0.6; built: 2021-06-29T15:28:56.259Z; git: 37e7731b4b142a882d73974ff3bec78d621bd674; jvm 11.0.10+9
[main] INFO org.eclipse.jetty.server.handler.ContextHandler - Started o.e.j.s.ServletContextHandler@782a4fff{org.apache.jena.fuseki.Servlet,/,null,AVAILABLE}
[main] INFO org.eclipse.jetty.server.AbstractConnector - Started ServerConnector@c7a975a{HTTP/1.1, (http/1.1)}{0.0.0.0:3000}
[main] INFO org.eclipse.jetty.server.Server - Started Server@35beb15e{STARTING}[10.0.6,sto=0] @889ms
[main] INFO org.apache.jena.fuseki.Server - Start Fuseki (http=3000)

Access the SPARQL UI at the address http://localhost:3000/sparql, where you can copy the query above and execute it. See the usage section for details on the SPARQL Anything Fuseki server.

Supported Formats

Currently, SPARQL Anything supports the following list of formats but the possibilities are limitless! The data is interpreted as in the following examples (using default settings).

A detailed description of the interpretation can be found in the following pages:

... and, of course, the triples generated from the these formats can be integrated with the content of any RDF Static file

Configuration

SPARQL Anything behaves as a standard SPARQL query engine. For example, the SPARQL Anything server will act as a virtual endpoint that can be queried exactly as a remote SPARQL endpoint. In addition, SPARQL Anything provides a rich Command Line Interface (CLI). For information for how to run SPARQL Anything, please see the quickstart and usage sections of the documentation.

Passing triplification options via SERVICE IRI

In order to instruct the query processor to delegate the execution to SPARQL Anything, you can use the following IRI-schema within SERVICE clauses. A minimal URI that uses only the resource locator is also possible. In this case SPARQL Anything guesses the data source type from the file extension.

Note: Use the file:// protocol to reference local files

Passing triplification options via Basic Graph Pattern

Alternatively, options can be provided as basic graph pattern inside the SERVICE clause as follows

PREFIX xyz: <http://sparql.xyz/facade-x/data/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX fx: <http://sparql.xyz/facade-x/ns/>

SELECT ?seriesName
WHERE {

    SERVICE <x-sparql-anything:> {
        fx:properties fx:location "https://sparql-anything.cc/example1.json" .
        ?tvSeries xyz:name ?seriesName .
        ?tvSeries xyz:stars ?star .
        ?star fx:anySlot "Courteney Cox" .
    }

}

Note that

The SERVICE IRI scheme must be x-sparql-anything:.
Each triplification option to pass to the engine corresponds to a triple of the Basic Graph Pattern inside the SERVICE clause.
Such triples must have fx:properties as subject, fx:[OPTION-NAME] as predicate, and a literal or a variable as object.

You can also mix the two modalities as follows.

PREFIX xyz: <http://sparql.xyz/facade-x/data/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX fx: <http://sparql.xyz/facade-x/ns/>

SELECT ?seriesName
WHERE {

    SERVICE <x-sparql-anything:blank-nodes=false> {
        fx:properties fx:location "https://sparql-anything.cc/example1.json" .
        ?tvSeries xyz:name ?seriesName .
        ?tvSeries xyz:stars ?star .
        ?star fx:anySlot "Courteney Cox" .
    }

}

General purpose options

Option name	Description	Valid Values	Default Value
location*	The URL of the data source.	Any valid URL or (absolute or relative) path of the file system.	*
content*	The content to be transformed.	Any valid literal.	*
command*	An external command line to be executed. The output is handled according to the option 'media-type'	Any valid literal.	*
from-archive	The filename of the resource to be triplified within an archive.	Any filename.	No value
root	The IRI of generated root resource. The root will be used as a namespace for the graphs and containers that will be generated.	Any valid IRI.	location (in the case of location argument set) or 'http://sparql.xyz/facade-x/data/' + md5Hex(content) (in the case of content argument set) or 'http://sparql.xyz/facade-x/data/' + md5Hex(command) (in the case of command argument set)
media-type	The media-type of the data source.	Any valid Media-Type. Supported media types are specified in the pages dedicated to the supported formats	No value (the media-type will be guessed from the the file extension)
namespace	The namespace prefix for the properties and classes that will be generated.	Any valid namespace prefix.	http://sparql.xyz/facade-x/data/
blank-nodes	It tells SPARQL Anything to generate blank nodes or not.	true/false	true
trim-strings	Trim all string literals.	true/false	false
null-string	Do not produce triples where the specified string would be in the object position of the triple.	Any string	No value
http.*	A set of options for customising HTTP request method, headers, querystring, and others. More details on the HTTP request configuration	No value
triplifier	It forces SPARQL Anything to use a specific triplifier for transforming the data source	A canonical name of a Java class	No value
charset	The charset of the data source.	Any charset.	UTF-8
metadata	It tells SPARQL Anything to extract metadata from the data source and to store it in the named graph with URI <http://sparql.xyz/facade-x/data/metadata> More details	true/false	false
ondisk	It tells SPARQL Anything to use an on disk graph (instead of the default in memory graph). The string should be a path to a directory where the on disk graph will be stored. Using an on disk graph is almost always slower (than using the default in memory graph) but with it you can triplify large files without running out of memory.	A path to a directory	No value
ondisk.reuse	When using an on disk graph, it tells SPARQL Anything to reuse the previous on disk graph.	true/false	true
strategy	The execution strategy. 0 = in memory, all triples; 1 = in memory, only triples matching any of the triple patterns in the where clause	0,1	1
slice	The resources is sliced and the SPARQL query executed on each one of the parts. Supported by: CSV (row by row); JSON (when array slice by item, when json object requires `json.path`); XML (requires `xml.path`)	true/false	false
use-rdfs-member	It tells SPARQL Anything to use the (super)property rdfs:member instead of container membership properties (rdf:_1, rdf:_2 ...)	true/false	false
annotate-triples-with-slot-keys	It tells SPARQL Anything to annotate slot statements with slot keys (see issue #378)	true/false	false

* It is mandatory to provide either location, content, or command.

More details on configuration

Query templates and variable bindings (CLI only)

The SPARQL Anything CLI supports parametrised queries. SPARQL Anything uses the BASIL convention for variable names in queries .

The syntax is based on the underscore character: '_', and can be easily learned by examples:

?_name The variable specifies the API mandatory parameter name. The value is incorporated in the query as plain literal.
?__name The parameter name is optional.
?_name_iri The variable is substituted with the parameter value as a IRI.
?_name_en The parameter value is considered as literal with the language 'en' (e.g., en,it,es, etc.).
?_name_integer The parameter value is considered as literal and the XSD datatype 'integer' is added during substitution.
?_name_prefix_datatype The parameter value is considered as literal and the datatype 'prefix:datatype' is added during substitution. The prefix must be specified according to the SPARQL syntax.

Variable bindings can be passed in two ways via the CLI argument -v|--values:

Inline arguments, e.g.: -v paramName=value1 -v paramName=value2 -v paramName2=other
Passing an SPARQL Result Set file, e.g.: -v selectResult.xml

In the first case, the engine computes the cardinal product of all the variables bindings included and execute the query for each one of the resulting set of bindings.

In the second case, the query is executed for each set of bindings in the result set.

The following is an example of how parameter can be used in a query:

PREFIX xyz: <http://sparql.xyz/facade-x/data/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX fx: <http://sparql.xyz/facade-x/ns/>

SELECT ?seriesName
WHERE {
    SERVICE <x-sparql-anything:https://sparql-anything.cc/example1.json> {
        ?tvSeries xyz:name ?seriesName .
        ?tvSeries xyz:stars ?star .
        ?star fx:anySlot ?_starName .
    }

}

The value of ?_starName can be passed via the CLI as follows:

java -jar sparql-anything-<version>.jar -q query.sparql -v starName="Courteney Cox"

Functions and magic properties

SPARQL Anything provides a number of magical functions and properties to facilitate the users in querying the sources and constructing knowledge graphs.

NOTE: SPARQL Anything is built on Apache Jena, see a list of supported functions on the Apache Jena documentation.

Name	Function/Magic Property	Input	Output	Description
fx:anySlot	Magic Property	-	-	This property matches the RDF container membership properties (e.g. `rdf:_1`, `rdf:_2` ...).
fx:cardinal(?a)	Function	Container membership property	Integer	`fx:cardinal(?a)` returns the corresponding cardinal integer from `?a` (`rdf:_24` -> `24`)
fx:isContainerMembershipProperty(?p)	Function	Container membership property	Boolean	`fx:isContainerMembershipProperty(?p)` returns true if the node passed as parameter is a container membership property (`rdf:_24` -> `true`)
fx:before(?a, ?b)	Function	Container membership properties	Boolean	`fx:before(?a, ?b)` returns `true` if `?a` and `?b` are container membership properties and `?a` is lower than `?b`, `false` otherwise
fx:after(?a, ?b)	Function	Container membership properties	Boolean	`fx:after(?a, ?b)` returns `true` if `?a` and `?b` are container membership properties and `?a` is higher than `?b`, `false` otherwise
fx:previous(?a)	Function	Container membership property	Container membership property	`fx:previous(?a)` returns the container membership property that preceeds `?a` (`rdf:_2` -> `rdf:_1`)
fx:next(?b)	Function	Container membership property	Container membership property	`fx:next(?b)` returns the container membership property that succeedes `?b` (`rdf:_1` -> `rdf:_2`)
fx:forward(?a, ?b)	Function	Container membership property, Integer	Container membership property	`fx:forward(?a, ?b)` returns the container membership property that follows `?a` of `?b` steps (`rdf:_2, 5` -> `rdf:_7`)
fx:backward(?a, ?b)	Function	Container membership property, Integer	Container membership property	`fx:backward(?a, ?b)` returns the container membership property that preceeds `?a` of `?b` steps (`rdf:_24, 4` -> `rdf:_20`)
fx:String.startsWith(?stringA, ?stringB)	Function	String, String	Boolean	`fx:String.startsWith` wraps `java.lang.String.startsWith`
fx:String.endsWith(?stringA, ?stringB)	Function	String, String	Boolean	`fx:String.endsWith` wraps `java.lang.String.endsWith`
fx:String.indexOf(?stringA, ?stringB)	Function	String, String	Integer	`fx:String.indexOf` wraps `java.lang.String.indexOf`
fx:String.substring(?string)	Function	String, Integer, (Integer?)	String	`fx:String.substring` wraps `java.lang.String.substring`
fx:String.toLowerCase(?string)	Function	String	String	`fx:String.toLowerCase` wraps `java.lang.String.toLowerCase`
fx:String.toUpperCase	Function	String	String	`fx:String.toUpperCase` wraps `java.lang.String.toUpperCase`
fx:String.replace(?string, ?characterA, ?characterB)	Function	String, Character, Character	String	`fx:String.replace` wraps `java.lang.String.replace`
fx:String.trim(?string)	Function	String	String	`fx:String.trim` wraps `java.lang.String.trim`
fx:String.stripLeading(?string)	Function	String	String	`fx:String.stripLeading` wraps `java.lang.String.stripLeading`
fx:String.stripTrailing(?string)	Function	String	String	`fx:String.stripTrailing` wraps `java.lang.String.stripTrailing`
fx:String.removeTags(?string)	Function	String	String	`fx:String.removeTags` removes the XML tags from the input string
fxWordUtils.capitalize(?string)	Function	String	String	`WordUtils.capitalize` wraps `org.apache.commons.text.WordUtils.capitalize`
fxWordUtils.capitalizeFully(?string)	Function	String	String	`fx:WordUtils.capitalizeFully` wraps `org.apache.commons.text.WordUtils.capitalizeFully`
fx:WordUtils.initials(?string)	Function	String	String	`fx:WordUtils.initials` wraps `org.apache.commons.text.WordUtils.initials`
fx:WordUtils.swapCase(?string)	Function	String	String	`fx:WordUtils.swapCase` wraps `org.apache.commons.text.WordUtils.swapCase`
fx:WordUtils.uncapitalize(?string)	Function	String	String	`fx:WordUtils.uncapitalize` wraps `org.apache.commons.text.WordUtils.uncapitalize`
fx:DigestUtils.md2Hex(?string)	Function	String	String	`fx:DigestUtils.md2Hex` wraps `org.apache.commons.codec.digest.DigestUtils.md2Hex`
fx:DigestUtils.md5Hex(?string)	Function	String	String	`fx:DigestUtils.md5Hex` wraps `org.apache.commons.codec.digest.DigestUtils.md5Hex`
fx:DigestUtils.sha1Hex(?string)	Function	String	String	`fx:DigestUtils.sha1Hex` wraps `org.apache.commons.codec.digest.DigestUtils.sha1Hex`
fx:DigestUtils.sha256Hex(?string)	Function	String	String	`fx:DigestUtils.sha256Hex` wraps `org.apache.commons.codec.digest.DigestUtils.sha256Hex`
fx:DigestUtils.sha384Hex(?string)	Function	String	String	`fx:DigestUtils.sha384Hex` wraps `org.apache.commons.codec.digest.DigestUtils.sha384Hex`
fx:DigestUtils.sha512Hex(?string)	Function	String	String	`fx:DigestUtils.sha512Hex` wraps `org.apache.commons.codec.digest.DigestUtils.sha512Hex`
fx:URLEncoder.encode(?string)	Function	String, String	String	`fx:URLEncoder.encode` wraps `java.net.URLEncoder.encode`
fx:URLDecoder.decode(?string)	Function	String, String	String	`fx:URLDecoder.decode` wraps `java.net.URLDecoder.decode`
fx:serial(?a ... ?n)	Function	Any sequence of nodes	Integer	The function `fx:serial (?a ... ?n)` generates an incremental number using the arguments as reference counters. For example, calling `fx:serial("x")` two times will generate `1` and then `2`. Instead, calling `fx:serial(?x)` multiple times will generate sequential numbers for each value of `?x`.
fx:entity(?a ... ?n)	Function	Any sequence of node	URI node	The function `fx:entity (?a ... ?n)` accepts a list of arguments and performs concatenation and automatic casting to string. Container membership properties (`rdf:_1`,`rdf:_2`,...) are cast to numbers and then to strings (`"1","2"`).
fx:literal(?a, ?b)	Function	String, (URI or language code)	Literal node	The function `fx:literal( ?a , ?b )` builds a literal from the string representation of `?a`, using `?b` either as a typed literal (if a IRI is given) or a lang code (if a string of length of two is given).
fx:bnode(?a)	Function	Any node	Blank node	The function `fx:bnode( ?a)` builds a blank node enforcing the node value as local identifier. This is useful when multiple construct templates are populated with bnode generated on different query solutions but we want them to be joined in the output RDF graph. Apparently, the standard function `BNODE` does generate a new node for each query solution (see issue #273 for an explanatory case).
fx:LevenshteinDistance(?n1, ?n2)	Function	String, String	Integer	The function `fx:LevenshteinDistance(?n1, ?n2)` computes the Levenshtein Distance between ?n1 and ?n2 (see #182).
fx:CosineDistance(?n1, ?n2)	Function	String, String	Double	The function `fx:CosineDistance(?n1, ?n2)` computes the Cosine Distance between ?n1 and ?n2 (see #182).
fx:JaccardDistance(?n1, ?n2)	Function	String, String	Double	The function `fx:JaccardDistance(?n1, ?n2)` computes the Jaccard Distance between ?n1 and ?n2 (see #182).
fx:JaroWinklerDistance(?n1, ?n2)	Function	String, String	Double	The function `fx:JaroWinklerDistance(?n1, ?n2)` computes the Jaro-Winkler Distance between ?n1 and ?n2 (see #182).
fx:LongestCommonSubsequenceDistance(?n1, ?n2)	Function	Any pair of IRIs or Literals	Integer	The function `fx:LongestCommonSubsequenceDistance(?n1, ?n2)` computes the Longest Common Subsequence Distance between ?n1 and ?n2 (see #182).
fx:HammingDistance(?n1, ?n2)	Function	String, String	Integer	The function `fx:HammingDistance(?n1, ?n2)` computes the Hamming Distance between ?n1 and ?n2 (see #182).
fx:QGramDistance(?n1, ?n2)	Function	String, String	Double	The function `fx:QGramDistance(?n1, ?n2)` computes the QGram Distance between ?n1 and ?n2 (see #394).

Usage

SPARQL Anything is available as Java Library, Command Line Interface, Web Application Server, and also Python library.

Command Line Interface (CLI)

An executable JAR can be obtained from the Releases page.

The jar can be executed as follows:

usage: java -jar sparql.anything-null  -q query [-f <output format>] [-v
            <filepath | name=value> ... ] [-c option=value]  [-l path] [-o
            filepath]
 -q,--query <query>                    The path to the file storing the
                                       query to execute or the query
                                       itself.
 -o,--output <file>                    OPTIONAL - The path to the output
                                       file. [Default: STDOUT]
 -a,--append                           OPTIONAL - Should output to file be
                                       appended? WARNING: this option does
                                       not ensure that the whole file is
                                       valid -- that is up to the user to
                                       set up the conditions (such as
                                       using NQ serialization and not
                                       using blank nodes)
 -e,--explain                          OPTIONAL - Explain query execution
 -l,--load <load>                      OPTIONAL - The path to one RDF file
                                       or a folder including a set of
                                       files to be loaded. When present,
                                       the data is loaded in memory and
                                       the query executed against it.
 -f,--format <string>                  OPTIONAL -  Format of the output
                                       file. Supported values: JSON, XML,
                                       CSV, TEXT, TTL, NT, NQ. [Default:
                                       TEXT or TTL]
 -s,--strategy <strategy>              OPTIONAL - Strategy for query
                                       evaluation. Possible values: '1' -
                                       triple filtering (default), '0' -
                                       triplify all data. The system
                                       fallbacks to '0' when the strategy
                                       is not implemented yet for the
                                       given resource type.
 -p,--output-pattern <outputPattern>   OPTIONAL - Output filename pattern,
                                       e.g. 'my-file-?friendName.json'.
                                       Variables should start with '?' and
                                       refer to bindings from the input
                                       file. This option can only be used
                                       in combination with 'input' and is
                                       ignored otherwise. This option
                                       overrides 'output'.
 -v,--values <values>                  OPTIONAL - Values passed as input
                                       parameter to a query template. When
                                       present, the query is pre-processed
                                       by substituting variable names with
                                       the values provided. The argument
                                       can be used in two ways. (1)
                                       Providing a single SPARQL ResultSet
                                       file. In this case, the query is
                                       executed for each set of bindings
                                       in the input result set. Only 1
                                       file is allowed. (2) Named variable
                                       bindings: the argument value must
                                       follow the syntax:
                                       var_name=var_value. The argument
                                       can be passed multiple times and
                                       the query repeated for each set of
                                       values.
 -c,--configuration <option=value>     OPTIONAL - Configuration to be
                                       passed to the SPARQL Anything
                                       engine (this is equivalent to
                                       define them in the SERVICE IRI).
                                       The argument can be passed multiple
                                       times (one for each option to be
                                       set). Options passed in this way
                                       can be overwritten in the SERVICE
                                       IRI or in the Basic Graph Pattern.
 -i,--input <input>                    [Deprecated] OPTIONAL - The path to
                                       a SPARQL result set file to be used
                                       as input. When present, the query
                                       is pre-processed by substituting
                                       variable names with values from the
                                       bindings provided. The query is
                                       repeated for each set of bindings
                                       in the input result set.

Logging can be configured adding the following option (SLF4J).

To enable the default logger for SPARQL anything and its dependencies:

-Dorg.slf4j.simpleLogger.defaultLogLevel=trace

To enable the default logger for SPARQL anything only:

-Dorg.slf4j.simpleLogger.log.io.github.sparqlanything=trace

Fuseki

An executable JAR of a SPARQL-Anything-powered Fuseki endpoint can be obtained from the Releases page.

The jar can be executed as follows:

usage: java -jar sparql-anything-server-<version>.jar [-p port] [-e
            sparql-endpoint-path] [-g endpoint-gui-path]
 -e,--path <path>   The path where the server will be running on (Default
                    /sparql.anything).
 -g,--gui <gui>     The path of the SPARQL endpoint GUI (Default /sparql).
 -p,--port <port>   The port where the server will be running on (Default
                    3000 ).

Also, a docker image can be used by following the instructions here.

Java Library

SPARQL Anything is available on Maven Central. To use it as a Java library please follow the instructions here

Extension Mechanisms

You can extend SPARQL Anything by including new triplifiers, more details can be found here.

Python Library

You can use SPARQL Anything as a Python library, see the PySPARQL-Anything project.

Compiling

You can generate executable files of the command line interface and server with maven

mvn clean install -Dgenerate-cli-jar=true -Dgenerate-server-jar=true

Licence

SPARQL Anything is distributed under Apache 2.0 License

How to cite our work

For citing SPARQL Anything in academic papers please use:

Luigi Asprino, Enrico Daga, Aldo Gangemi, and Paul Mulholland. 2022. Knowledge Graph Construction with a façade: a unified method to access heterogeneous data sources on the Web. ACM Trans. Internet Technol. Just Accepted (2022) . https://doi.org/10.1145/3555312 Preprint

@article{10.1145/3555312,
author = {Asprino, Luigi and Daga, Enrico and Gangemi, Aldo and Mulholland, Paul},
title = {Knowledge Graph Construction with a Fa\c{c}ade: A Unified Method to Access Heterogeneous Data Sources on the Web},
year = {2022},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
issn = {1533-5399},
url = {https://doi.org/10.1145/3555312},
doi = {10.1145/3555312},
abstract = {Data integration is the dominant use case for RDF Knowledge Graphs. However, Web resources come in formats with weak semantics (for example CSV and JSON), or formats specific to a given application (for example BibTex, HTML, and Markdown). To solve this problem, Knowledge Graph Construction (KGC) is gaining momentum due to its focus on supporting users in transforming data into RDF. However, using existing KGC frameworks result in complex data processing pipelines, which mix structural and semantic mappings, whose development and maintenance constitute a significant bottleneck for KG engineers. Such frameworks force users to rely on different tools, sometimes based on heterogeneous languages, for inspecting sources, designing mappings, and generating triples, thus making the process unnecessarily complicated. We argue that it is possible and desirable to equip KG engineers with the ability of interacting with Web data formats by relying on their expertise in RDF and the well-established SPARQL query language&nbsp;[2]. In this article, we study a unified method for data access to heterogeneous data sources with Facade-X, a meta-model implemented in a new data integration system called SPARQL Anything. We demonstrate that our approach is theoretically sound, since it allows a single meta-model, based on RDF, to represent data from (a) any file format expressible in BNF syntax, as well as (b) any relational database. We compare our method to state-of-the-art approaches in terms of usability (cognitive complexity of the mappings) and general performance. Finally, we discuss the benefits and challenges of this novel approach by engaging with the reference user community.},
journal = {ACM Trans. Internet Technol.},
keywords = {RDF, SPARQL, Meta-model, Re-engineering}
}

Conference paper mainly focussing on system requirements:

Daga, Enrico; Asprino, Luigi; Mulholland, Paul and Gangemi, Aldo (2021). Facade-X: An Opinionated Approach to SPARQL Anything. In: Alam, Mehwish; Groth, Paul; de Boer, Victor; Pellegrini, Tassilo and Pandit, Harshvardhan J. eds. Volume 53: Further with Knowledge Graphs, Volume 53. IOS Press, pp. 58–73.

DOI: https://doi.org/10.3233/ssw210035 | PDF

@incollection{oro78973,
          volume = {53},
           month = {August},
          author = {Enrico Daga and Luigi Asprino and Paul Mulholland and Aldo Gangemi},
       booktitle = {Volume 53: Further with Knowledge Graphs},
          editor = {Mehwish Alam and Paul Groth and Victor de Boer and Tassilo Pellegrini and Harshvardhan J. Pandit},
           title = {Facade-X: An Opinionated Approach to SPARQL Anything},
       publisher = {IOS Press},
            year = {2021},
         journal = {Studies on the Semantic Web},
           pages = {58--73},
        keywords = {SPARQL; meta-model; re-engineering},
             url = {http://oro.open.ac.uk/78973/},
        abstract = {The Semantic Web research community understood since its beginning how crucial it is to equip practitioners with methods to transform non-RDF resources into RDF. Proposals focus on either engineering content transformations or accessing non-RDF resources with SPARQL. Existing solutions require users to learn specific mapping languages (e.g. RML), to know how to query and manipulate a variety of source formats (e.g. XPATH, JSON-Path), or to combine multiple languages (e.g. SPARQL Generate). In this paper, we explore an alternative solution and contribute a general-purpose meta-model for converting non-RDF resources into RDF: {\ensuremath{<}}i{\ensuremath{>}}Facade-X{\ensuremath{<}}/i{\ensuremath{>}}. Our approach can be implemented by overriding the SERVICE operator and does not require to extend the SPARQL syntax. We compare our approach with the state of art methods RML and SPARQL Generate and show how our solution has lower learning demands and cognitive complexity, and it is cheaper to implement and maintain, while having comparable extensibility and efficiency.}
}

sparql.anything's People

Contributors

Stargazers

Watchers

Forkers

ghxiao justin2004 aahmadai mwx23 aghoshpro anhlt18vn alexdma mathiasvda kvistgaard emidiostani volland

sparql.anything's Issues

Refactor artefacts and package names

to com.github.sparqlanything.*

JSON:API / Drupal

Any idea on how to integrate data exposed using a JSON:API based API ?

That would help to integrate data contained in Drupal-systems. See https://www.drupal.org/docs/core-modules-and-themes/core-modules/jsonapi-module

Support embedding content to triplify as IRI argument

SELECT ?o {

SERVICE <facade-x:content=Text to triplify with txt triplifier, media-type=application/text> {?s rdf:_1 ?o}

}

This query should return "Text to triplify with txt triplifier".

Cleanup logging framework

Multiple SLF4J bindings, references to log4j12 could be avoided.

Avoid executing the same transformation multiple times when facade-x IRI built from variable bindings

facade-x SERVICE IRIs can be dynamically generated from variable bindings. However, the same variables may be evaluated multiple times with the same value, resulting in repeated calls to the SERVICE clause with the exact same parameters!

The executor may remember service calls already performed and avoid repeating the same operation multiple times.

XML triplifier to use default fx: namespace

Instead of using the file location.

Collaborate with Ontop

Please collaborate with Ontop.

See

This was allready suggested in #6

Support bibliographic formats e.g. bibtex

#127
Document

Support traversing a local directory

It could be useful to just traverse a local directory and explore its content with sparql.anything

Support Excel and Google spreadsheets

As the title says

SHACL definition of Facade-X

It would be a useful tool for validating the output of sparql.anything transformers.

Improve description of URI schema, removing BNF-like syntax

As the summary says, currently the documentation gives a false impression that the IRI schema shown is in a formal BNF, which is not.

Handling namespaces for properties and entities

At the moment, the namespace option parameter is used as default prefix fro both schema elements and named entities. This may create problems, for example, when joining two CSVs. Users should declare different namespaces to avoid clashes. An easier way may be to use the root entity (file name) as default prefix.

Implement blank-node=false for CSV, HTML, and XML

This should be straight-forward for all the currently supported formats.

Audit graph

We can add an option audit=1 to include a graph with information about the generated, queried graphs, using the SPARQL Service Description Vocabulary and VoID. This would be an optional meta-graph, useful for debugging and troubleshooting.

We can use a new boolean option 'audit=1' (defaults off)

Guess output format from output file extension

At the moment, the CLI wants you to specify the output format as a separate parameter:

fx -q titles.rq -o titles.xml -f xml

While doing the following generates a JSON file (named .XML) - because the format falls back to defalut/JSON:

fx -q titles.rq  -o titles.xml

facade-x IRI not compliant with IETF scheme

Jena complains as follows:

Bad IRI: <facade-x:media-type=text/html,html.selector=#az-group,location=https://imma.ie/artists/> Code: 45/UNREGISTERED_NONIETF_SCHEME_TREE in SCHEME: The scheme name has a "-" in it, but it does not start in "x-" and the prefix is not known as the prefix of an alternative tree for URI schemes.

We may change the URI scheme, discussion open in alternatives.

Add CLI parameter 'strategy'

We are developing alternative approaches to query execution, these should be also available as options in the CLI.

New parameter: load

This is useful to chain the output of one query into another. RDF files can be loaded and queried along with the output of the service clause(s).

CLI: default format depending on query type or output file extension

ASK / SELECT: application/sparql-results+json
CONSTRUCT: text/turtle

java -jar issue: constructing graphs without -f parameter leads to an error

small issue: the help says TTL is the default but without -f TTL I get this exception:

Exception in thread "main" org.apache.jena.riot.RiotException: No graph writer for 'Lang:RDF/XML'

Triple pattern filtering

Currently, triplifiers transform all the data before executing the query. We could use the query expression and extract triple patterns to limit the triples added to the model. The resulting graph could be a subset of the full graph, including all the triples useful to evaluate the query. This approach is easy to implement and may improve performance in some (but not all) cases.

Default facade-x ns uri

Currently we are using urn:facade-x:ns# but it would be better to use a public one, e.g. http://sparql.xyz/facade-x/ns/

Include simple logger in cli artefact

We can use slf4j:slf4j-simple

Support file archives (zip, tar, ...)

It would be useful to be able to traverse the content of file archives.

Relative location seems not working properly for paths like "file.ext"

[XML] Reuse declared namespaces when building element types and attribute names (and make sure they end with # or /)

As the summary says, at the moment, declared namespaces are ignored.

Version

We should use -SNAPSHOT versioning for builds that are not meant to produce releases.

For an explanation see https://stackoverflow.com/questions/5901378/what-exactly-is-a-maven-snapshot-and-why-do-we-need-it

Facade-X options as basic graph patterns

It could be useful to support facade-x configuration from within the SERVICE query with magic properties.

CLI option --output not working

The cli ignores the parameter and prints to STDOUT

CLI: support multiple output formats

And to pick the right serialisation format, accordingly to the following serialisations: JSON, XML, CSV, TEXT, TTL, NT

In the future, we may reuse the io.github.basilapi:rendering library which is based on mime types as format identifiers and supports a large variety of formats.

HTML profiles: DOM, Microdata, RDFa, ...

HTML can be a data source in many different ways. Currently, we support a CSS-selector and generate a Facade-X representation of the related DOM portions (each solution in a separate graph). We should consider also alternative approaches, for example, referring to well-known approaches to embed data, such as Microdata or RDFa.

[html] incorrect placement of text nodes

Currently, text nodes are misplaced, the content being appended to the last child item.

New parameter: input

This feature will allow chaining the output of a previous query to a new one.

When the input is a sparql result set, it will be used as a source of query parameters, following the BASIL convention.

Support blank space in option values

[html] Include dom components innerText and dom innerHTML

It would be very useful in addition to the HTML source data, to also have values from those two elements. However, they should not be represented with the same namespace we use for the content (xhtml:). One option is to use the location of the DOM specification: https://dom.spec.whatwg.org/#.

RDF files as resources

It could be handy to allow users to SPARQL static RDF files using the SERVICE operator. I think SPARQL.anything should include this feature although it is not related to Facade-X so probably we don't want to use the same protocol handler?

Performance: slicing

A simple way of improving usability with large files is to support a general option to 'slice' content in some ways. Some ideas:

by indicating the number of containers to be produced, or the maximum number of values
by indicating the top N slots, or the bottom N
by indicating a limit in the triples produced.

We may think of others.

Support triplifications of HTTP Headers when resolving HTTP URLs

automatic detection mime-type

https://docs.oracle.com/javase/7/docs/api/java/nio/file/Files.html#probeContentType%28java.nio.file.Path%29 ??

[html] Invalid local name

An exception occurs with details:

Exception in thread "main" org.apache.jena.shared.InvalidPropertyURIException: http://www.w3.org/1999/xhtml#http:
	at org.apache.jena.rdf.model.impl.PropertyImpl.checkLocalName(PropertyImpl.java:66)
	at org.apache.jena.rdf.model.impl.PropertyImpl.<init>(PropertyImpl.java:55)
	at org.apache.jena.rdf.model.ResourceFactory$Impl.createProperty(ResourceFactory.java:296)
	at org.apache.jena.rdf.model.ResourceFactory.createProperty(ResourceFactory.java:144)

Exception when reading remote CSV

It seems that CSV Triplifier has a problem when reading remote resources and throws an exception here.
The issue can be reproduced with the query:

PREFIX xyz: <http://sparql.xyz/facade-x/data/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

SELECT *
WHERE {

    SERVICE <x-sparql-anything:csv.headers=true,location=https://raw.githubusercontent.com/pcm-dpc/COVID-19/master/dati-andamento-nazionale/dpc-covid19-ita-andamento-nazionale-20200409.csv> {
        ?s ?p ?o
    }

}

Consistent parameter names in URIs

I suggest using location, namespace (applied to domain specific properties/types), media-type for general-purpose parameters and specify type-specific ones using, for example, csv.format=DEFAULT or json.root ...

Support CSV first row as header

If my CSV has a first row as header, as in the example:

email,id,first name,last name
[email protected],2070,Laura,Grey
[email protected],4081,Craig,Johnson
[email protected],9346,Mary,Jenkins
[email protected],5079,Jamie,Smith

could I supply an option for the headers to provide additional property bindings (escaping spaces and all) instead of rdf:_x?

Based on the code at CSVTriplifier.java, I tried to pass -Dcsv.headers=true, but it doesn't seem to be picked up by the way the engine passes Java properties to the triplifier.

Use input parameters to customise the output filename

When iterating over input parameters, when multiple files are generated, it would be useful to customise the output filename.
This is useful for creating a collection of RDF files each one identified nicely from the parameters.

For example, having an input including tuples such as

<?xml version="1.0"?>
<sparql xmlns="http://www.w3.org/2005/sparql-results#">
  <head>
    <variable name="artistUrl"/>
    <variable name="artistNickname"/>
  </head>
  <results>
    <result>
      <binding name="artistUrl">
        <literal>https://imma.ie/artists/william-leech/</literal>
      </binding>
      <binding name="artistNickname">
        <literal>leech-william</literal>
      </binding>
    </result>
    <result>
      <binding name="artistUrl">
        <literal>https://imma.ie/artists/marie-foley/</literal>
      </binding>
      <binding name="artistNickname">
        <literal>foley-marie</literal>
      </binding>
    </result>

being able to instruct the tool to use artistNickname as (part of the) output filename.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

Jobs

Jooble