GithubHelp home page GithubHelp logo

swirrl / csv2rdf Goto Github PK

View Code? Open in Web Editor NEW
25.0 25.0 6.0 41.97 MB

Clojure library and command line application for converting CSV to RDF. An implementation of the W3C CSVW specifications

License: Eclipse Public License 1.0

Clojure 2.64% Ruby 0.10% HTML 96.11% CSS 0.09% JavaScript 0.49% Shell 0.03% Python 0.03% Vim Snippet 0.39% Haml 0.12%
clojure csv csvw linked-data rdf

csv2rdf's People

Contributors

callum-oakley avatar github-actions[bot] avatar lkitching avatar rickmoynihan avatar robsteranium avatar scottlowe avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

csv2rdf's Issues

Resolve file URIs locally instead of over HTTP

All JSON URIs are resolved by attempting to access them over HTTP. File URIs should be resolved by reading the referenced file instead. This can be reproduced with the following invocation from the root of the repository:

 java -jar target/csv2rdf-0.4.5-SNAPSHOT-standalone.jar \
    -u w3c-csvw/tests/test034/csv-metadata.json \
    -m annotated -o output.ttl

Allow `-t` argument to override the metadata's url property

At the moment, the user-supplied metadata must provide a "url" property that is used to locate the tabular data. It would be nice to instead be able to specify both the tabular data and the metadata as arguments:

csv2rdf -t data.csv -u metadata.json

Thus, the "url" property would become "data.csv".

This would mean that you wouldn't need to know the location of the tabular data when the metadata is created. This would allow you to reuse a metadata schema for multiple csv tables.

In these cases it would be nice to be able to skip this property from the metadata (making it clear from the json that the table still needed to be specified) rather than having to provide a superfluous value that is then overriden.

If the "url" is missing currently we get the exception:

Expected top-level of metadata document to describe a table or table group

It would be nice if the error message provided a hint as how to resolve this, e.g.

No table or table group provided. You can specify this with the "url" property or "-t" command-line argument.

Add graph CLI option to allow quads to be output

Add optional --graph command-line option which causes quads to be output with the corresponding graph if specified. Also validate that quad output formats can only be used if a graph is provided.

Potential GC problem through use of gensym to generate blank nodes

When we generate blank nodes for various annotation data, we make use of clojure's gensym to generate unique names (within the processes execution).

gensym interns strings which for large inputs will cause at least one bnode per row. Interning these will make these strings less eligible for garbage collection, likely resulting in larger GC pauses etc.

There may be arguments for interning more commonly used URIs such as predicates and classes, to help reduce memory foot print but blank nodes will if anything have the opposite effect.

See also posts such as this: https://shipilev.net/jvm/anatomy-quarks/10-string-intern/

Generate separate tests for RDF and validation tests

The CSVW test cases described in manifest.csv are used to describe multiple test variants e.g. RDF, JSON and validation tests. The test cases in csvw_test.clj generated from manifest.csv currently combines assertions for both RDF and validation tests. Validation is not currently supported, so update the generate to generate RDF tests.

Merge embedded and user table schemas

Any columns defined in the embedded schema but not in the user-provided metadata should be added to the schema used for parsing the row data. See section 8.10.4.5.1.1 of the tabular specification.

Failure to fetch HTTPS URIs

Hi,

I just noticed that using https URIs to reference remote tables/schemas fails with:

#error {
 :cause No method in multimethod 'request-uri-input-stream' for dispatch value: :https

My workaround is to use the :default keyword in the request-uri-input-stream function:

diff --git a/src/csv2rdf/source.clj b/src/csv2rdf/source.clj
index 020df2c..1fa1034 100644
--- a/src/csv2rdf/source.clj
+++ b/src/csv2rdf/source.clj
@@ -58,7 +58,7 @@
 
 (defmulti request-uri-input-stream (fn [^URI uri] (keyword (.getScheme uri))))
 
-(defmethod request-uri-input-stream :http [uri]
+(defmethod request-uri-input-stream :default [uri]
   (let [{:keys [status headers body] :as response} (http/get-uri uri)]
     (if (http/is-not-found-response? response)
       (throw (ex-info

Support parallelising processing

We can support a -p N parallelism flag that runs the transformation in N threads. Hopefully cutting processing time drastically.

This should be relatively straightforward by:

  1. Inspecting the dialect data and from that deriving what end of line tokens are etc.
  2. Looking at the file length in bytes
  3. Crudely splitting the file into equal portions equivalent to N
  4. Refine the split offsets slightly by scanning from their to the next true end of line.
  5. Wind N streams to their appropriate offsets
  6. Read the header row and give it to each thread
  7. Pass each stream to N threads
  8. Have each thread output to a separate file rdf file (appropriately numbered).
  9. Potentially if asked support a concat flag, that will reconcattenate the files together.

The key to making this fast is to avoid parsing the whole CSV into batches in the splitting step. Any final concat should also just do so at the file level without any parsing of RDF.

Return code / system.exit()

Any chance you could return an error code on exception? I'm always surprised that the JVM doesn't do this automatically, but if an unhandled exception is raised, the return code defaults to 0 (success).

Integers turned into longs

When I adding extra metadata in -metadata.json, I expect that the metadata to be parsed as per the JSON-LD rules and turned into RDF, which for the most part appears ok.

However for simple numerical values as in the attachment (see "qb:order": 1 etc.), I would expect the resulting RDF to be qb:order 1 where the value's datatype is the default of xsd:int, whereas csv2rdf outputs qb:order "1"^^xsd:long.

example.zip

liberal-mapcat holds onto generated sequences

Attempting to call csv->rdf->destination with a tableGroup referencing a table with a large number of generated statements results in a space leak since liberal-mapcat is holding onto the generated subsequences.

Check specification for handling of cells that begin with triple quotes

When the dialect doubleQuote option is true, quotes are escaped by repeating the quote character e.g. some ""quoted"" text is parsed as some "quoted" text. Cells containing the delimiter character must be quoted and the quote must be the first character and last in the cell content e.g.

"Cell containing, the delimiter"

It would be expected that quoted cells that begin with an escaped quote should use triple quotes e.g.

"""Quoted text"", followed by rest"

and that this should be parsed as an opening quote for the cell value followed by an escaped quote. The current implement treats this as an escaped quote followed by a quote and subsequently errors since the second quote is not the first character in the cell. Check the specification and current implementation.

Upgrade to grafter-2 and RDF4j

This library currently depends on an old version of grafter (0.11.4). It would be good to upgrade this to the latest grafter 2.0.2, with the RDF4j backend.

We should probably look at doing this in a phased approach:

  1. Upgrade grafter dependency to 2.0.2
  2. Explicitly add the transitive sesame dependency into csv2rdf (as grafter-2 no longer includes this). Also add :exclusions for RDF4j.
  3. Leave the grafter 1 require's as is
  4. Run the tests etc and fix up any errors
  5. Coin a release that uses grafter 2 via the grafter 1 ns's with the sesame dependency.
  6. Remove the :exclusions from grafter-2 dep.
  7. Now port the grafter namespaces to grafter-2
  8. Make all tests etc pass
  9. Release a new grafter-2 version of csv2rdf.

Create command-line interface

Create a command line interface for invoking the CSV->RDF process. The options should include:

  • Specifying the tabular or metadata file
  • Which CSVW mode to use (default: standard)
  • An optional output file to write the results (default: standard output)

Outputting URIs using separator argument doesn't give multiple triples

When using the separator argument to attempt outputting multiple triples, the object of which is a URI, I don't get expected behaviour.

example.csv:

name,knows
Ross,"http://example.org/Alice*http://example.org/Bob*http://example.org/Charlie"

example.csv-metadata.json:

{
    "@context": "http://www.w3.org/ns/csvw",
    "url": "example.csv",
    "tableSchema": {
        "columns": [
            {
                "titles": "name",
                "name": "name",
                "suppressOutput": true
            },
            {
                "titles": "knows",
                "name": "knows",
                "separator": "*",
                "propertyUrl": "foaf:knows",
                "valueUrl": "{knows}"
            }
        ],
        "aboutUrl": "https://example.org/{name}"
    }
}

Expected output

<https://example.org/example> <http://www.w3.org/ns/prov#wasDerivedFrom>
    <http://example.org/Alice>, <http://example.org/Bob>, <http://example.org/Charlie> .

Actual output

<https://example.org/Ross> <http://xmlns.com/foaf/0.1/knows>
    <http://example.org/Alice,http://example.org/Bob,http://example.org/Charlie> .

Note that the * separator has been converted to a ,. I've tried a few different choices for separators and the behaviour remains the same, so I don't believe it's caused by my specific choice of *.

When removing the separator argument, I do get three triples the object of which is a string, as expected.

<https://example.org/Ross> <http://xmlns.com/foaf/0.1/knows> "http://example.org/Alice",
    "http://example.org/Bob", "http://example.org/Charlie" .

Display processing status in command-line output

Add an optional status indicator to the CLI output indicating how far through the intput file(s) the processing is. This could periodically write a message of the form:

processing row 10000 of input.csv

Define a 'minimal plus' mode

The 'minimal' mode defined in the CSVW specification does not output notes or annotations for tables or tablegroups defined within the metadata. These are output in standard mode, but standard mode also outputs statements describing the structure of the tabular data. Define an additional mode which works like standard mode but which also outputs notes and annotations for tables and tablegroups in the metadata when they specify an @id.

Infer RDF format in csv->rdf->file

csvw/csv->rdf->file currently always writes turtle to the output file. Update it to try to infer the RDF format from the file extension and use turtle as the default if the format could not be inferred.

CSVW datatype tests are very slow

csv2rdf validates all parsed cells against the xml type hierarchy, this operation is currently very expensive as it walks the type tree for each cell everytime.

Screenshot 2021-05-11 at 09 46 02

Memoizing the call to is-subtype? should result in an approximate 50% speed increase of overall transformation runtime.

NoSuchFieldError bug when importing table2qb in muttnik deps

Added table2qb in muttnik deps, calling it from a simple clojure function results in:

Syntax error (NoSuchFieldError) compiling at (mut/cube_builder.clj:64:1).
JAVA_ISO_CONTROL

Seems to be a problem in CSV2RDF uri template usage?

Syntax error (NoSuchFieldError) compiling at (mut/cube_builder.clj:64:1).
JAVA_ISO_CONTROL
user=> *e
#error {
 :cause "JAVA_ISO_CONTROL"
 :via
 [{:type clojure.lang.Compiler$CompilerException
   :message "Syntax error compiling at (mut/cube_builder.clj:64:1)."
   :data #:clojure.error{:phase :compile-syntax-check, :line 64, :column 1, :source "mut/cube_builder.clj"}
   :at [clojure.lang.Compiler load "Compiler.java" 7648]}
  {:type java.lang.NoSuchFieldError
   :message "JAVA_ISO_CONTROL"
   :at [com.github.fge.uritemplate.parse.CharMatchers <clinit> "CharMatchers.java" 40]}]
 :trace
 [[com.github.fge.uritemplate.parse.CharMatchers <clinit> "CharMatchers.java" 40]
  [com.github.fge.uritemplate.parse.VariableSpecParser <clinit> "VariableSpecParser.java" 44]
  [sun.reflect.NativeMethodAccessorImpl invoke0 "NativeMethodAccessorImpl.java" -2]
  [sun.reflect.NativeMethodAccessorImpl invoke "NativeMethodAccessorImpl.java" 62]
  [sun.reflect.DelegatingMethodAccessorImpl invoke "DelegatingMethodAccessorImpl.java" 43]
  [java.lang.reflect.Method invoke "Method.java" 498]
  [csv2rdf.util$invoke_method invokeStatic "util.clj" 75]
  [csv2rdf.util$invoke_method invoke "util.clj" 69]
  [csv2rdf.util$invoke_method invokeStatic "util.clj" 72]
  [csv2rdf.util$invoke_method invoke "util.clj" 69]
  [csv2rdf.metadata.column$parse_uri_template_variable invokeStatic "column.clj" 26]
  [csv2rdf.metadata.column$parse_uri_template_variable invoke "column.clj" 25]
  [csv2rdf.metadata.column$uri_template_variable invokeStatic "column.clj" 30]
  [csv2rdf.metadata.column$uri_template_variable invoke "column.clj" 28]
  [csv2rdf.metadata.column$validate_column_name invokeStatic "column.clj" 41]
  [csv2rdf.metadata.column$validate_column_name invoke "column.clj" 38]
  [csv2rdf.metadata.validator$chain$fn__8201$fn__8202 invoke "validator.clj" 99]
  [csv2rdf.metadata.validator$optional_key$fn__8228 invoke "validator.clj" 166]
  [csv2rdf.metadata.validator$kvps$fn__8233$fn__8234 invoke "validator.clj" 184]
  [clojure.core$map$fn__5866 invoke "core.clj" 2755]
  [clojure.lang.LazySeq sval "LazySeq.java" 42]
  [clojure.lang.LazySeq seq "LazySeq.java" 51]
  [clojure.lang.RT seq "RT.java" 535]
  [clojure.core$seq__5402 invokeStatic "core.clj" 137]
  [clojure.core$filter$fn__5893 invoke "core.clj" 2809]
  [clojure.lang.LazySeq sval "LazySeq.java" 42]
  [clojure.lang.LazySeq seq "LazySeq.java" 51]
  [clojure.lang.Cons next "Cons.java" 39]
  [clojure.lang.RT next "RT.java" 713]
  [clojure.core$next__5386 invokeStatic "core.clj" 64]
  [clojure.core.protocols$fn__8159 invokeStatic "protocols.clj" 169]
  [clojure.core.protocols$fn__8159 invoke "protocols.clj" 124]
  [clojure.core.protocols$fn__8114$G__8109__8123 invoke "protocols.clj" 19]
  [clojure.core.protocols$seq_reduce invokeStatic "protocols.clj" 31]
  [clojure.core.protocols$fn__8146 invokeStatic "protocols.clj" 75]
  [clojure.core.protocols$fn__8146 invoke "protocols.clj" 75]
  [clojure.core.protocols$fn__8088$G__8083__8101 invoke "protocols.clj" 13]
  [clojure.core$reduce invokeStatic "core.clj" 6828]
  [clojure.core$into invokeStatic "core.clj" 6895]
  [clojure.core$into invoke "core.clj" 6887]
  [csv2rdf.metadata.validator$kvps$fn__8233 invoke "validator.clj" 186]
  [csv2rdf.metadata.types$validate_object_of$fn__8442 invoke "types.clj" 293]
  [csv2rdf.metadata.validator$variant$v__8207 invoke "validator.clj" 120]
  [csv2rdf.metadata.validator$variant$v__8207 invoke "validator.clj" 117]
  [csv2rdf.metadata.validator$array_of$fn__8211$fn__8212 invoke "validator.clj" 130]
  [clojure.core$map_indexed$mapi__8548$fn__8549 invoke "core.clj" 7308]
  [clojure.lang.LazySeq sval "LazySeq.java" 42]
  [clojure.lang.LazySeq seq "LazySeq.java" 51]
  [clojure.lang.RT seq "RT.java" 535]
  [clojure.core$seq__5402 invokeStatic "core.clj" 137]
  [clojure.core$filter$fn__5893 invoke "core.clj" 2809]
  [clojure.lang.LazySeq sval "LazySeq.java" 42]
  [clojure.lang.LazySeq seq "LazySeq.java" 51]
  [clojure.lang.RT seq "RT.java" 535]
  [clojure.lang.LazilyPersistentVector create "LazilyPersistentVector.java" 44]
  [clojure.core$vec invokeStatic "core.clj" 377]
  [clojure.core$vec invoke "core.clj" 367]
  [csv2rdf.metadata.validator$array_of$fn__8211 invoke "validator.clj" 129]
  [csv2rdf.metadata.column$columns invokeStatic "column.clj" 98]
  [csv2rdf.metadata.column$columns invoke "column.clj" 97]
  [csv2rdf.metadata.validator$optional_key$fn__8228 invoke "validator.clj" 166]
  [csv2rdf.metadata.validator$kvps$fn__8233$fn__8234 invoke "validator.clj" 184]
  [clojure.core$map$fn__5866 invoke "core.clj" 2755]
  [clojure.lang.LazySeq sval "LazySeq.java" 42]
  [clojure.lang.LazySeq seq "LazySeq.java" 51]
  [clojure.lang.RT seq "RT.java" 535]
  [clojure.core$seq__5402 invokeStatic "core.clj" 137]
  [clojure.core$filter$fn__5893 invoke "core.clj" 2809]
  [clojure.lang.LazySeq sval "LazySeq.java" 42]
  [clojure.lang.LazySeq seq "LazySeq.java" 51]
  [clojure.lang.Cons next "Cons.java" 39]
  [clojure.lang.RT next "RT.java" 713]
  [clojure.core$next__5386 invokeStatic "core.clj" 64]
  [clojure.core.protocols$fn__8159 invokeStatic "protocols.clj" 169]
  [clojure.core.protocols$fn__8159 invoke "protocols.clj" 124]
  [clojure.core.protocols$fn__8114$G__8109__8123 invoke "protocols.clj" 19]
  [clojure.core.protocols$seq_reduce invokeStatic "protocols.clj" 31]
  [clojure.core.protocols$fn__8146 invokeStatic "protocols.clj" 75]
  [clojure.core.protocols$fn__8146 invoke "protocols.clj" 75]
  [clojure.core.protocols$fn__8088$G__8083__8101 invoke "protocols.clj" 13]
  [clojure.core$reduce invokeStatic "core.clj" 6828]
  [clojure.core$into invokeStatic "core.clj" 6895]
  [clojure.core$into invoke "core.clj" 6887]
  [csv2rdf.metadata.validator$kvps$fn__8233 invoke "validator.clj" 186]
  [csv2rdf.metadata.types$validate_object_of$fn__8442 invoke "types.clj" 293]
  [csv2rdf.metadata.validator$variant$v__8207 invoke "validator.clj" 120]
  [csv2rdf.metadata.validator$variant$v__8207 invoke "validator.clj" 117]
  [csv2rdf.metadata.schema$schema invokeStatic "schema.clj" 98]
  [csv2rdf.metadata.schema$schema invoke "schema.clj" 97]
  [csv2rdf.metadata.validator$variant$v__8207 invoke "validator.clj" 120]
  [csv2rdf.metadata.validator$variant$v__8207 invoke "validator.clj" 117]
  [csv2rdf.metadata.validator$optional_key$fn__8228 invoke "validator.clj" 166]
  [csv2rdf.metadata.validator$kvps$fn__8233$fn__8234 invoke "validator.clj" 184]
  [clojure.core$map$fn__5866 invoke "core.clj" 2755]
  [clojure.lang.LazySeq sval "LazySeq.java" 42]
  [clojure.lang.LazySeq seq "LazySeq.java" 51]
  [clojure.lang.RT seq "RT.java" 535]
  [clojure.core$seq__5402 invokeStatic "core.clj" 137]
  [clojure.core$filter$fn__5893 invoke "core.clj" 2809]
  [clojure.lang.LazySeq sval "LazySeq.java" 42]
  [clojure.lang.LazySeq seq "LazySeq.java" 51]
  [clojure.lang.Cons next "Cons.java" 39]
  [clojure.lang.RT next "RT.java" 713]
  [clojure.core$next__5386 invokeStatic "core.clj" 64]
  [clojure.core.protocols$fn__8159 invokeStatic "protocols.clj" 169]
  [clojure.core.protocols$fn__8159 invoke "protocols.clj" 124]
  [clojure.core.protocols$fn__8114$G__8109__8123 invoke "protocols.clj" 19]
  [clojure.core.protocols$seq_reduce invokeStatic "protocols.clj" 31]
  [clojure.core.protocols$fn__8146 invokeStatic "protocols.clj" 75]
  [clojure.core.protocols$fn__8146 invoke "protocols.clj" 75]
  [clojure.core.protocols$fn__8088$G__8083__8101 invoke "protocols.clj" 13]
  [clojure.core$reduce invokeStatic "core.clj" 6828]
  [clojure.core$into invokeStatic "core.clj" 6895]
  [clojure.core$into invoke "core.clj" 6887]
  [csv2rdf.metadata.validator$kvps$fn__8233 invoke "validator.clj" 186]
  [csv2rdf.metadata.types$validate_object_of$fn__8442 invoke "types.clj" 293]
  [csv2rdf.metadata.validator$variant$v__8207 invoke "validator.clj" 120]
  [csv2rdf.metadata.validator$variant$v__8207 invoke "validator.clj" 117]
  [csv2rdf.metadata.types$validate_contextual_object$fn__8469 invoke "types.clj" 373]
  [csv2rdf.metadata.validator$variant$v__8207 invoke "validator.clj" 120]
  [csv2rdf.metadata.validator$variant$v__8207 invoke "validator.clj" 117]
  [csv2rdf.metadata.table$parse_table_json invokeStatic "table.clj" 38]
  [csv2rdf.metadata.table$parse_table_json invoke "table.clj" 37]
  [csv2rdf.metadata$parse_metadata_json invokeStatic "metadata.clj" 17]
  [csv2rdf.metadata$parse_metadata_json invoke "metadata.clj" 10]
  [csv2rdf.metadata$parse_table_group_from_source invokeStatic "metadata.clj" 23]
  [csv2rdf.metadata$parse_table_group_from_source invoke "metadata.clj" 21]
  [csv2rdf.tabular.processing$get_metadata invokeStatic "processing.clj" 26]
  [csv2rdf.tabular.processing$get_metadata invoke "processing.clj" 20]
  [csv2rdf.csvw$csv__GT_rdf invokeStatic "csvw.clj" 28]
  [csv2rdf.csvw$csv__GT_rdf invoke "csvw.clj" 19]
  [table2qb.pipelines.codelist$codelist__GT_csvw__GT_rdf invokeStatic "codelist.clj" 106]
  [table2qb.pipelines.codelist$codelist__GT_csvw__GT_rdf invoke "codelist.clj" 101]
  [table2qb.pipelines.codelist$codelist_pipeline invokeStatic "codelist.clj" 113]
  [table2qb.pipelines.codelist$codelist_pipeline invoke "codelist.clj" 108]
  [clojure.lang.AFn applyToHelper "AFn.java" 165]
  [clojure.lang.AFn applyTo "AFn.java" 144]
  [clojure.lang.Var applyTo "Var.java" 705]
  [clojure.core$apply invokeStatic "core.clj" 665]
  [clojure.core$apply invoke "core.clj" 660]
  [table2qb.cli.tasks$exec_pipeline$fn__7531 invoke "tasks.clj" 192]
  [table2qb.cli.tasks$exec_pipeline invokeStatic "tasks.clj" 187]
  [table2qb.cli.tasks$exec_pipeline invoke "tasks.clj" 184]
  [table2qb.cli.tasks$eval7540$fn__7542 invoke "tasks.clj" 217]
  [clojure.lang.MultiFn invoke "MultiFn.java" 239]
  [table2qb.main$inner_main invokeStatic "main.clj" 26]
  [table2qb.main$inner_main invoke "main.clj" 18]
  [mut.cube_builder$generate_all_cubes invokeStatic "cube_builder.clj" 58]
  [mut.cube_builder$generate_all_cubes invoke "cube_builder.clj" 49]
  [mut.cube_builder$eval7588 invokeStatic "cube_builder.clj" 64]
  [mut.cube_builder$eval7588 invoke "cube_builder.clj" 64]
  [clojure.lang.Compiler eval "Compiler.java" 7177]
  [clojure.lang.Compiler load "Compiler.java" 7636]
  [clojure.lang.RT loadResourceScript "RT.java" 381]
  [clojure.lang.RT loadResourceScript "RT.java" 372]
  [clojure.lang.RT load "RT.java" 459]
  [clojure.lang.RT load "RT.java" 424]
  [clojure.core$load$fn__6839 invoke "core.clj" 6126]
  [clojure.core$load invokeStatic "core.clj" 6125]
  [clojure.core$load doInvoke "core.clj" 6109]
  [clojure.lang.RestFn invoke "RestFn.java" 408]
  [clojure.core$load_one invokeStatic "core.clj" 5908]
  [clojure.core$load_one invoke "core.clj" 5903]
  [clojure.core$load_lib$fn__6780 invoke "core.clj" 5948]
  [clojure.core$load_lib invokeStatic "core.clj" 5947]
  [clojure.core$load_lib doInvoke "core.clj" 5928]
  [clojure.lang.RestFn applyTo "RestFn.java" 142]
  [clojure.core$apply invokeStatic "core.clj" 667]
  [clojure.core$load_libs invokeStatic "core.clj" 5985]
  [clojure.core$load_libs doInvoke "core.clj" 5969]
  [clojure.lang.RestFn applyTo "RestFn.java" 137]
  [clojure.core$apply invokeStatic "core.clj" 667]
  [clojure.core$require invokeStatic "core.clj" 6007]
  [clojure.core$require doInvoke "core.clj" 6007]
  [clojure.lang.RestFn invoke "RestFn.java" 408]
  [user$eval148 invokeStatic "NO_SOURCE_FILE" 1]
  [user$eval148 invoke "NO_SOURCE_FILE" 1]
  [clojure.lang.Compiler eval "Compiler.java" 7177]
  [clojure.lang.Compiler eval "Compiler.java" 7132]
  [clojure.core$eval invokeStatic "core.clj" 3214]
  [clojure.core$eval invoke "core.clj" 3210]
  [clojure.main$repl$read_eval_print__9086$fn__9089 invoke "main.clj" 437]
  [clojure.main$repl$read_eval_print__9086 invoke "main.clj" 437]
  [clojure.main$repl$fn__9095 invoke "main.clj" 458]
  [clojure.main$repl invokeStatic "main.clj" 458]
  [clojure.main$repl_opt invokeStatic "main.clj" 522]
  [clojure.main$main invokeStatic "main.clj" 667]
  [clojure.main$main doInvoke "main.clj" 616]
  [clojure.lang.RestFn invoke "RestFn.java" 397]
  [clojure.lang.AFn applyToHelper "AFn.java" 152]
  [clojure.lang.RestFn applyTo "RestFn.java" 132]
  [clojure.lang.Var applyTo "Var.java" 705]
  [clojure.main main "main.java" 40]]}

Log cell errors as warnings

Errors encountered when parsing cell values are currently associated with the cell but are not logged as a warning.

csv2rdf is tolerant of invalid JSON

I was surprised to discover that csv2rdf is tolerant of broken json in its metadata files, in particular missing commas etc.

Without wanting to open up the tolerant reader debate I think it would be better if we could also require that JSON always be valid, and raise an error when it's not.

Personally I find JSON a bit awkward to write (it's very easy to accidentally leave superfluous comma or colon delimiters in the json or accidentally omit them), and I worry that others crafting JSON metadata without extra tooling support will create JSON that is invalid and only we can parse.

I think it would be better to require strict parsing to ensure that people are conservative in what they create.

It looks like this is an unexpected behaviour we have inherited from clojure.data.json/read-str:

;; with missing comma (invalid json) parses fine:
user> (clojure.data.json/read-str "{\"foo\":\"bar\" \"baz\":\"blah\"}") ;;=> {"foo" "bar", "baz" "blah"}
;; valid json with comma
user> (clojure.data.json/read-str "{\"foo\":\"bar\", \"baz\":\"blah\"}") ;;=> {"foo" "bar", "baz" "blah"}

I don't think the data.json parser supports a strict parsing flag though so we might not be able to support this without changing JSON parser in which case we may wish to drop support for tolerant reading of JSON; or pick up a dependency on another parser that supports strict mode. I'd very much favour not having the extra dependency, but it may depend on whether we expect to handle invalid JSON.

Incidentally I checked the JSON spec, and it looks like strictly speaking it's valid for JSON processors to interpret broken syntax; they're just required to also handle valid JSON.

Add extension allowing literal values to be specified

The valueUrl property within column definitions allow URL values to be created, but other data types are not supported. Consider adding a column property for literal values e.g.

{"aboutUrl": "http://subject",
 "propertyUrl": "http://property",
 "value": "true",
 "datatype": "xsd:boolean",
 "virtual": true}

Fail fast if the output directory can't be found/accessed

The Execution error (FileNotFoundException) at java.io.FileOutputStream/open0 (FileOutputStream.java:-2). error appears to only happen after any conversion has been done (which can take a long time), would be useful if it fails faster due to inaccessible or missing target directory for the output RDF

How to download the csv2rdf-standalone.jar ?

Hi, I have looked into the README.md and doc/ folder docs but I could not find how to download or build the executable csv2rdf-standalone.jar

The ideal would be to be able to download the csv2rdf-standalone.jar directly from the GitHub releases (if you don't have already another distribution system). You can easily attach any file, like a jar file, to a new (or existing) release

This would allow users to easily download the latest stable release of the executable jar file to run it directly with Java (without the need for clojure, knowledge or dependencies)

Thanks for this library, it looks really interesting!

Include line number in CSV parse errors

Tabular file parse error messages include the invalid character index e.g.

Opening quote must be at start of cell value at index n

Include the source line number so the invalid data can be located in the source file.

Include log4j configuration in uberjar

Running the uberjar results in a log4j configuration warning being displayed since there is no log config on the classpath. Include a logging config in the uberjar (but not the main library which should not have a logging backend dependency).

Resolve JSON URIs with http/get-uri

A large number of tests in the test/csvw_test namespace are failing because the metadata JSON URIs mocked in the tests are being resolved directly instead of via the mock HTTP client. Ensure JSON sources are resolved using http/get-uri which uses the currently-bound HTTPClient.

Case sensitivity in CSV boolean parsing

I have some data that has the values True and False but the parser does not see these as valid boolean values, it would be good if the parser was case insensitive to this to save me a pre-conversion step.

URI encode whitespace

If you pass a value like "foo bar" into a URI template, it ought to expand into "foo%20bar" - i.e. URI-encoding the whitespace (see e.g. section 1.2 of the URI-template RFC).

At the moment this (reproducible example: whitespace.zip) leads to the following exception:

ERROR StatusLogger No Log4j 2 configuration file found. Using default configuration (logging only errors to the console), or user programmatically provided configurations. Set system property 'log4j2.debug' to show Log4j 2 internal initialization logging. See https://logging.apache.org/log4j/2.x/manual/configuration.html for instructions on how to configure Log4j 2
13:55:59.619 [main] ERROR csv2rdf.main - #error {
 :cause Illegal character in path at index 22: http://example.net/foo bar
 :via
 [{:type java.lang.IllegalArgumentException
   :message Illegal character in path at index 22: http://example.net/foo bar
   :at [java.net.URI create URI.java 852]}
  {:type java.net.URISyntaxException
   :message Illegal character in path at index 22: http://example.net/foo bar
   :at [java.net.URI$Parser fail URI.java 2848]}]
 :trace
 [[java.net.URI$Parser fail URI.java 2848]
  [java.net.URI$Parser checkChars URI.java 3021]
  [java.net.URI$Parser parseHierarchical URI.java 3105]
  [java.net.URI$Parser parse URI.java 3053]
  [java.net.URI <init> URI.java 588]
  [java.net.URI create URI.java 850]
  [java.net.URI resolve URI.java 1036]
  [csv2rdf.util$resolve_uri invokeStatic util.clj 115]
  [csv2rdf.util$resolve_uri invoke util.clj 105]
  [csv2rdf.metadata.uri_template_property$resolve_uri_template_property invokeStatic uri_template_property.clj 23]
  [csv2rdf.metadata.uri_template_property$resolve_uri_template_property invoke uri_template_property.clj 17]
  [csv2rdf.tabular.csv$get_cell_urls invokeStatic csv.clj 138]
  [csv2rdf.tabular.csv$get_cell_urls invoke csv.clj 137]
  [csv2rdf.tabular.csv$annotate_row$fn__5873 invoke csv.clj 149]
  [clojure.core$map$fn__5587 invoke core.clj 2747]
  [clojure.lang.LazySeq sval LazySeq.java 40]
  [clojure.lang.LazySeq seq LazySeq.java 49]
  [clojure.lang.RT seq RT.java 528]
  [clojure.lang.LazilyPersistentVector create LazilyPersistentVector.java 44]
  [clojure.core$vec invokeStatic core.clj 377]
  [clojure.core$vec invoke core.clj 367]
  [csv2rdf.tabular.csv$annotate_row invokeStatic csv.clj 154]
  [csv2rdf.tabular.csv$annotate_row invoke csv.clj 143]
  [csv2rdf.tabular.csv$annotate_rows$fn__5888 invoke csv.clj 182]
  [clojure.core$map$fn__5587 invoke core.clj 2745]
  [clojure.lang.LazySeq sval LazySeq.java 40]
  [clojure.lang.LazySeq seq LazySeq.java 49]
  [clojure.lang.RT seq RT.java 528]
  [clojure.core$seq__5124 invokeStatic core.clj 137]
  [clojure.core$seq__5124 invoke core.clj 137]
  [csv2rdf.util$liberal_mapcat invokeStatic util.clj 287]
  [csv2rdf.util$liberal_mapcat invoke util.clj 284]
  [csv2rdf.csvw.minimal$fn__6050 invokeStatic minimal.clj 23]
  [csv2rdf.csvw.minimal$fn__6050 invoke minimal.clj 22]
  [clojure.lang.MultiFn invoke MultiFn.java 238]
  [csv2rdf.csvw$get_table_statements invokeStatic csvw.clj 17]
  [csv2rdf.csvw$get_table_statements invoke csvw.clj 14]
  [csv2rdf.csvw$csv__GT_rdf$fn__6771 invoke csvw.clj 33]
  [csv2rdf.util$liberal_mapcat invokeStatic util.clj 288]
  [csv2rdf.util$liberal_mapcat invoke util.clj 284]
  [csv2rdf.csvw$csv__GT_rdf invokeStatic csvw.clj 32]
  [csv2rdf.csvw$csv__GT_rdf invoke csvw.clj 19]
  [csv2rdf.csvw$csv__GT_rdf__GT_destination invokeStatic csvw.clj 42]
  [csv2rdf.csvw$csv__GT_rdf__GT_destination invoke csvw.clj 37]
  [csv2rdf.main$write_output invokeStatic main.clj 52]
  [csv2rdf.main$write_output invoke main.clj 49]
  [csv2rdf.main$_main invokeStatic main.clj 70]
  [csv2rdf.main$_main doInvoke main.clj 56]
  [clojure.lang.RestFn applyTo RestFn.java 137]
  [csv2rdf.main main nil -1]]}

When we would instead expect (minimal mode) output like:

<http://example.net/foo%20bar> <http://www.w3.org/2000/01/rdf-schema#label> "foo bar" .

IOException: Stream closed when run via clojure cli tool

e.g.

clj -A:with-logging -m csv2rdf.main -t w3c-csvw/tests/test001.csv 

...
_:rowG__8522 <http://www.w3.org/ns/csvw#rownum> 8;
  <http://www.w3.org/ns/csvw#url> <file:/Users/rick/repos/csv2rdf/w3c-csvw/tests/test001.csv#row=9> .

_:tableG__8507 <http://www.w3.org/ns/csvw#url> <file:/Users/rick/repos/csv2rdf/w3c-csvw/tests/test001.csv> .
Exception in thread "main" java.io.IOException: Stream closed
	at java.base/sun.nio.cs.StreamEncoder.ensureOpen(StreamEncoder.java:45)
	at java.base/sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:152)
	at java.base/java.io.OutputStreamWriter.flush(OutputStreamWriter.java:254)
	at clojure.core$flush.invokeStatic(core.clj:3703)
	at clojure.main$main.invokeStatic(main.clj:424)
	at clojure.main$main.doInvoke(main.clj:387)
	at clojure.lang.RestFn.applyTo(RestFn.java:137)
	at clojure.lang.Var.applyTo(Var.java:702)
	at clojure.main.main(main.java:37)

I'm assuming this is because we close *out* but the clojure.main wants to close that when run like this.

I think if we're using *out* we should assume something else is going to close it.

Failure when fetching external table schema

When we have external tables and schemas declared, e.g.:

  "tables": [
    {
      "url": "https://gss-cogs.github.io/family-trade/reference/codelists/flow-directions.csv",
      "tableSchema": "https://gss-cogs.github.io/ref_common/codelist-schema.json",
      "suppressOutput": true
    },
    {
      "url": "https://gss-cogs.github.io/family-trade/reference/codelists/sitc4.csv",
      "tableSchema": "https://gss-cogs.github.io/family-trade/reference/schemas/sitc4-schema.json",
      "suppressOutput": true
    },

the tableSchema is fetched and then read-json is attempted on the resulting string. I think it should be json/read-str instead, i.e.:

--- a/src/csv2rdf/source.clj
+++ b/src/csv2rdf/source.clj
@@ -33,7 +33,7 @@
   URI
   (get-json [uri]
     (let [{:keys [body]} (http/get-uri uri)]
-      (read-json body)))
+      (json/read-str body)))

   File
   (get-json [f] (read-json f))

Respect parent properties when merging with embedded metadata

processing/get-metadata loads the provided table/tableGroup JSON and merges it with the embedded metadata for each referenced tabular file. This occurs before parent properties are merged into the constructed metadata and as a result these are not resolved correctly in the merge. For example:

{
  "dialect": { ... }
  "url": "data.csv",
 "tableSchema": { ... }
}

is resolved into a tableGroup containing a single table. The top-level tableGroup specified no dialect which results in the default dialect being used to extract the embedded metadata instead of the one specified. The table dialect should instead be resolved in the usual way by searching up from the table to its parent tableGroup.

Similarly the following:

{
  "tableSchema": { ... },
  "tables": [
    {"url": "table1.csv"}
  ]
}

should result in the specified table using the schema defined by the parent tableGroup. This is not being resolved so is not merged with the embedded schema.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.