GithubHelp home page GithubHelp logo

atomgraph / csv2rdf Goto Github PK

View Code? Open in Web Editor NEW
51.0 7.0 3.0 83 KB

Streaming, transforming, SPARQL-based CSV to RDF converter. Apache license.

Home Page: https://hub.docker.com/r/atomgraph/csv2rdf

License: Apache License 2.0

Java 97.73% Dockerfile 1.75% Shell 0.51%
csv csv-converter rdf sparql transformer open-data streaming linked-data semantic-web transformation docker-image knowledge-graph csv2rdf

csv2rdf's Introduction

CSV2RDF

Streaming, transforming CSV to RDF converter

Reads CSV/TSV data as generic CSV/RDF, transforms each row using SPARQL CONSTRUCT or DESCRIBE, and streams the output triples. The generic CSV/RDF format is based on the minimal mode of Generating RDF from Tabular Data on the Web.

Such transformation-based approach enables:

  • building resource URIs on the fly
  • fixing/remapping datatypes
  • mapping different groups of values to different RDF structures

CSV2RDF differs from tarql in the way how mapping queries use graph patterns in the WHERE clause. tarql queries operate on a table of bindings (provided as an implicit VALUES block) in which CSV column names become variable names. CSV2RDF generates an intermediary RDF graph for each CSV row (using column names as relative-URI properties) that the WHERE patterns explicitly match against.

Build

mvn clean install

That should produce an executable JAR file target/csv2rdf-2.0.0-jar-with-dependencies.jar in which dependency libraries will be included.

Usage

The CSV data is read from stdin, the resulting RDF data is written to stdout.

CSV2RDF is available as a .jar as well as a Docker image atomgraph/csv2rdf (recommended).

Parameters:

  • query-file - a text file with SPARQL 1.1 CONSTRUCT query string
  • base - the base URI for the data (also becomes the BASE URI of the SPARQL query). Property namespace is constructed by adding # to the base URI.

Options:

  • -d, --delimiter - value delimiter character, by default ,.
  • --max-chars-per-column - max characters per column value, by default 4096
  • --input-charset - CSV input encoding, by default UTF-8
  • --output-charset - RDF output encoding, by default UTF-8

Note that delimiters might have a special meaning in shell. Therefore, always enclose them in single quotes, e.g. ';' when executing CSV2RDF from shell.

If you want to retrieve the raw CSV/RDF output, use the identity transform query CONSTRUCT WHERE { ?s ?p ?o }.

Example

CSV data in parking-facilities.csv:

postDistrict,roadCode,houseNumber,name,FID,long,lat,address,postcode,parkingSpace,owner,parkingType,information
1304 København K,24,5,Adelgade 5 p_hus.0,p_hus.0,12.58228733,55.68268042,Adelgade 5,1304,92,Privat,P-Kælder,"Adelgade 5-7, Q-park."

CONSTRUCT query in parking-facilities.rq:

PREFIX schema:     <https://schema.org/> 
PREFIX geo:        <http://www.w3.org/2003/01/geo/wgs84_pos#> 
PREFIX xsd:        <http://www.w3.org/2001/XMLSchema#> 
PREFIX rdf:        <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

CONSTRUCT
{
    ?parking a schema:ParkingFacility ;
        geo:lat ?lat ;
        geo:long ?long ;
        schema:name ?name ;
        schema:streetAddress ?address ;
        schema:postalCode ?postcode ;
        schema:maximumAttendeeCapacity ?spaces ;
        schema:additionalProperty ?parkingType ;
        schema:comment ?information ;
        schema:identifier ?id .
}
WHERE
{
    ?parkingRow <#FID> ?id ;
        <#name> ?name ;
        <#address> ?address ;
        <#lat> ?lat_string ;
        <#postcode> ?postcode ;
        <#parkingSpace> ?spaces_string ;
        <#parkingType> ?parkingType ;
        <#information> ?information ;
        <#long> ?long_string . 

    BIND(URI(CONCAT(STR(<>), ?id)) AS ?parking) # building URI from base URI and ID
    BIND(xsd:integer(?spaces_string) AS ?spaces)
    BIND(xsd:float(?lat_string) AS ?lat)
    BIND(xsd:float(?long_string) AS ?long)
}

Java execution from shell:

cat parking-facilities.csv | java -jar csv2rdf-2.0.0-jar-with-dependencies.jar parking-facilities.rq https://localhost/ > parking-facilities.ttl

Alternatively, Docker execution from shell:

cat parking-facilities.csv | docker run --rm -i -a stdin -a stdout -a stderr -v "$(pwd)/parking-facilities.rq":/tmp/parking-facilities.rq atomgraph/csv2rdf /tmp/parking-facilities.rq https://localhost/ > parking-facilities.ttl

Note that using Docker you need to:

  • bind stdin/stdout/stderr streams
  • mount the query file to the container, and use the filepath from within the container as query-file

Output in parking-facilities.ttl:

<https://localhost/p_hus.0> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <https://schema.org/ParkingFacility> .
<https://localhost/p_hus.0> <http://www.w3.org/2003/01/geo/wgs84_pos#long> "12.58228733"^^<http://www.w3.org/2001/XMLSchema#float> .
<https://localhost/p_hus.0> <https://schema.org/identifier> "p_hus.0" .
<https://localhost/p_hus.0> <https://schema.org/additionalProperty> "P-Kælder" .
<https://localhost/p_hus.0> <https://schema.org/comment> "Adelgade 5-7, Q-park." .
<https://localhost/p_hus.0> <https://schema.org/postalCode> "1304" .
<https://localhost/p_hus.0> <http://www.w3.org/2003/01/geo/wgs84_pos#lat> "55.68268042"^^<http://www.w3.org/2001/XMLSchema#float> .
<https://localhost/p_hus.0> <https://schema.org/streetAddress> "Adelgade 5" .
<https://localhost/p_hus.0> <https://schema.org/name> "Adelgade 5 p_hus.0" .
<https://localhost/p_hus.0> <https://schema.org/maximumAttendeeCapacity> "92"^^<http://www.w3.org/2001/XMLSchema#integer> .

Query examples

More mapping query examples can be found under LinkedDataHub's northwind-traders demo app.

Performance

Largest dataset tested so far: 2.8 GB / 3709725 rows of CSV to 21.7 GB / 151348939 triples in under 27 minutes. Hardware: x64 Windows 10 PC with Intel Core i5-7200U 2.5 GHz CPU and 16 GB RAM.

Dependencies

csv2rdf's People

Contributors

namedgraph avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

csv2rdf's Issues

How to convert CSV into RDF with both value and unit information?

Does CSV2RDF support using unit ontologies?

CSV data often have units associated, such as kilograms, grams, or microsecond, nanosecond, second etc.

One example is something has a weight of 2.4 kilograms:

exproduct:item10245 exterms:weight [ rdf:value "2.4"^^xsd:decimal ; exterms:units exunits:kilograms ] .

CLI to support named arguments

Multiple optional arguments really asks for a command line interface with named arguments, e.g.

--baseURI https://localhost/ --queryFile parking-facilities.rq --delimiter ; --maxCharsPerColumn 5000

Warnings are printed to stdout

Warnings from Jena, for instance when using an unknown function, are printed to std out, which messes up the output. This should be stderr or maybe a way to surpress them?

Throw exception when columns are not found

When you refer to an unexisting column, the tool fails silently and no output is produced. It takes a while until you notice this is the problem.

Would it be an idea to throw a suitable exception for this and add a flag to surpress these errors if desired?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.