Comments (16)
Hi,
I encountered the same problem when generating hdt files with hdt-cpp. Only query that contains explicit cast and filter will give you the results, in your case that would be:
?entity ?p ?o .
filter (?o = xsd:string(“A. F. W. Sommer”)) .
Also, it is worth noticing that if you generate your hdt files with hdt-java, you won’t have this problem.
Best,
Nevena
from hdt-java.
I wonder whether this has to do with the change in semantics of plain literals introduced in RDF 1.1? Just a thought.
from hdt-java.
Thanks for your hints, @mn120110d !
Only query that contains explicit cast and filter will give you the results, in your case that would be:
?entity ?p ?o .
filter (?o = xsd:string(“A. F. W. Sommer”)) .
Oh, that's a useful workaround. It does increase query execution time, though, since the endpoint first has to extract all triples and then remove most of them through the filter instead of directly looking in the index for relevant ones.
Also, it is worth noticing that if you generate your hdt files with hdt-java, you won’t have this problem.
Does that mean that the Java and the C++ implementations produce different results when I convert the same source file to hdt? To me there seem to be four different ways how that can happen:
- The spec is unambiguous, the C++ implementation makes it right and the Java implementation makes it wrong.
- The spec is unambiguous, the Java implementation makes it right and the C++ implementation makes it wrong
- The spec is unambiguous and neither the Java, nor the C++ implementation makes it right
- The spec is ambiguous and both the Java and the C++ implementations make it right while still producing different results.
Given that the C++ implementation seems to be better maintained, my guess would be the first option.
from hdt-java.
I wonder whether this has to do with the change in semantics of plain literals introduced in RDF 1.1? Just a thought.
Interesting thought. Can you expand a bit on that?
from hdt-java.
In RDF 1.0, the plain literal "foo"
is different from "foo"^^xsd:string
. In RDF 1.1 they are the same since all plain literals (i.e. no language tag and no datatype) have an implicit datatype of xsd:string
.
Now let's say the HDT file encodes such literals without a data type, but the hdt-jena layer expects them to have the data type xsd:string
(or vice versa). They would have a different encoding and thus wouldn't match.
from hdt-java.
OK, I see.
When I inspected the hdt file with hdt-it!, I found the string A. F. W. Sommer"^^http://www.w3.org/2001/XMLSchema#string
(with datatype) but adding the datatype in the SPARQL query didn't help. So the only possibility would be that the hdt-jena implementation strips the datatype from the literal in the SPARQL query but that further down the line it's expected.
from hdt-java.
You're welcome, @larsgsvensson .
I agree that the solution with filter increases execution time, but I haven't found a better way to do it.
Regarding your question:
Does that mean that the Java and the C++ implementations produce different results when I convert the same source file to hdt?
Here is an issue that mentions bout versions and their differences:
#58
I believe the answer to your question would be this part of the conversation:
regardless if a .hdt was created with C++ or Java, shouldn't it be the exact same format?
in theory yes, in practice not that easy without dedicated development teams :/
So my guess would be that no matter what tool you use, the generated hdt should be correct, but for now you'll have to adjust your queries to support different versions. I hope that someone more competent could give a better and more detailed explanation. :)
from hdt-java.
I'm curious about two questions here:
- Is this really dependent on the input file size? @larsgsvensson mentioned that the problem was with a large input file, but not with a very small subset. If that is true, where is the limit? There's a rather large gap between 266M triples and 13 triples...
- If the file generated with hdt-java works but the hdt-cpp version doesn't, what's the difference? Is the datatype of plain literals encoded differently in hdt files generated by the two tools?
Trying to answer these might lead to more clues about where the problem is.
I think that the hdt-java version was originally created using a Jena version earlier than 3.0, which changed the semantics of plain literals to follow RDF 1.1. In fact it must be, since the project started around 2012 and Jena 3.0 was released in July 2015. So maybe when the dependency was updated eventually updated to Jena 3.0+, there were some code paths left that expected the old RDF 1.0 style plain literals. Or maybe the hdt-cpp tools assume RDF 1.0 style literals while hdt-java is all RDF 1.1.
from hdt-java.
I'm not sure this is related to file size or which library generates the file. It seems like an incompatibility in the fuseki implementation.
I was able to recreate the issue with the bulk instrument file from permid.org (14Mb gzipped).
Using HDT-it! with the raw triples (which includes typed string literals) and default settings I get an HDT file that fuseki returns 0 rows for the queries:
select ?instr where { ?instr <http://permid.org/ontology/common/hasName> "Jardine Strategic Holdings IDR" . }
and
select ?instr where { ?instr <http://permid.org/ontology/common/hasName> "Jardine Strategic Holdings IDR"^^<http://www.w3.org/2001/XMLSchema#string> . }
If I remove the types from the file using sed
gzcat OpenPermID-bulk-instrument-20171106_072520.ntriples.gz | sed 's/\^\^<http:\/\/www.w3.org\/2001\/XMLSchema#string>//' > stripped.nt
the following query works in fuseki, returning one row
select ?instr where { ?instr <http://permid.org/ontology/common/hasName> "Jardine Strategic Holdings IDR" . }
Importantly, I get the same results whether using HDT-it (which relies on the CPP library) or the java-cli. I'm using the current build of HDT-it from the website on a mac and the current trunk from github for the CLI.
from hdt-java.
After some time I managed to get a closer look at this (or rather at the Server.java implementation that uses part of hdt-jena).
Having set up a TPF server using Server.java, I tried to query an HDT file where all literals have datatypes, particularly the default datatype is xsd:string
. It didn't work. I dug a bit through the source code and found that access to the HDT dictionary is handled by org.rdfhdt.hdtjena.NodeDictionary
. Here, the method #nodeToString
strips off the datatype from the literal if the datatype is xsd:string
which explains why the querying doesn't work:
public static String nodeToStr(Node node) {
if(node==null || node.isVariable()) {
return "";
}else if(node.isURI()) {
return node.getURI();
} else if(node.isLiteral()) {
RDFDatatype t = node.getLiteralDatatype();
if(t==null || XSDDatatype.XSDstring.getURI().equals(t.getURI())) {
// String
return "\""+node.getLiteralLexicalForm()+"\"";
} else if(RDFLangString.rdfLangString.equals(t)) {
// Lang
return "\""+node.getLiteralLexicalForm()+"\"@"+node.getLiteralLanguage();
} else {
// Typed
return "\""+node.getLiteralLexicalForm()+"\"^^<"+t.getURI()+">";
}
} else {
return node.toString();
}
}
I think that this isn't conform with RDF 1.1 §3.3:
Please note that concrete syntaxes MAY support simple literals consisting of only a lexical form without any datatype IRI or language tag. Simple literals are syntactic sugar for abstract syntax literals with the datatype IRI http://www.w3.org/2001/XMLSchema#string.
The implementation in NodeDictionary
turns the syntactic sugar into a norm.
from hdt-java.
@larsgsvensson It is indeed possible to store simple literals in HDT. This means that the same RDF 1.1 term can be represented by two distinct terms in an HDT.
from hdt-java.
@wouterbeek Yes, sure it is. The point I'm aiming at is that if I convert the RDF triple
:subject :predicate "object"^^xsd:string .
to hdt using the cpp implementation, I cannot query it using the Java implementation since the cpp converter will retain the datatype xsd:string
on all literals whereas the Java implementation strips the xsd:string
from the literal before querying the NodeDictionary
. That way the literal will never be found.
If there are two ways to store the literal, then I must be able to query them exactly as I stored them, or the search for "rdf-hdt"
in the object position must deliver the same result as the search for "rdf-hdt"^^xsd:string
(i. e. the implementation must consider
:subject :predicate "object"^^xsd:string .
and
:subject :predicate "object" .
equivalent in every respect). If the implementation lets me store it with an xsd:string datatype but doesn't let me query it that way, it means I shouldn't be allowed to store it that way in the first place.
As I see it, the only way to accomplish this is to mandate that in HDT files the datatype xsd:string
is always added if not already present, or it's always removed if present. Then the implementations accessing the HDT file would need to be adjusted accordingly.
from hdt-java.
@larsgsvensson I agree with you that the best solution would be to always store "..."^^<http://www.w3.org/2001/XMLSchema#>
in HDT, and never store "..."
.
Unfortunately, "..."
is a legal RDF term in Turtle, TriG, N-Triples, and N-Quads. This means that any compliant parser is allowed to emit "..."
, and that this should not be fixed in the parser (i..e, Serd).
Instead, what is needed is HDT-specific code that transforms "..."
into "..."^^<http://www.w3.org/2001/XMLSchema#>
upon HDT file creation. If somebody would be able to implement this in a pull request, then this would be very welcome.
(Notice that this would not invalidate existing HDTs that use "..."
. It would just guarantee that newly created HDTs are not ambiguous.)
from hdt-java.
@larsgsvensson I've crated an issue for this in the proper place: rdfhdt/hdt-cpp#173
You can close this current issue if there are no other hdt-java
specific components to it.
from hdt-java.
Thanks @wouterbeek. In rdfhdt/hdt-cpp#173 I suggested to do it the other way round since it seems that most implementations use the "..."
form when reading.
I don't think there are any other hdt-java
issues here so I'll close.
from hdt-java.
Just for documentation purposes, this is my current workaround:
- Since hdt-java depends on hdt-jena, update org.rdfhdt.hdtjena.NodeDictionary so that the methods NodeDictionary#nodeToStr(Node) and NodeDictionary#nodeToStr(Node, PrefixMapping) are non-static. Update testcases accordingly.
- In HdtBasedRequestProcessorForTPFs, overwrite the constructor as follows:
public HdtBasedRequestProcessorForTPFs(final String hdtFile) throws IOException {
this.datasource = HDTManager.mapIndexedHDT(hdtFile, null); // listener=null
this.dictionary = new NodeDictionary(this.datasource.getDictionary()) {
@Override
public String nodeToStr(final Node node) {
if (node == null || node.isVariable()) {
return "";
} else if (node.isURI()) {
return node.getURI();
} else if (node.isLiteral()) {
final RDFDatatype t = node.getLiteralDatatype();
if (t == null) {
// String
return "\"" + node.getLiteralLexicalForm() + "\"";
} else if (RDFLangString.rdfLangString.equals(t)) {
// Lang
return "\"" + node.getLiteralLexicalForm() + "\"@"
+ node.getLiteralLanguage();
} else {
// Typed
return "\"" + node.getLiteralLexicalForm() + "\"^^<"
+ t.getURI() + ">";
}
} else {
return node.toString();
}
}
};
}
I. e. I replace nodeToStr( Node)
with an implementation that keeps the datatype even when it's a string.
from hdt-java.
Related Issues (20)
- Release version 2.2 HOT 2
- Setup Github Action for Maven release HOT 18
- IllegalFormatException or IllegalArgumentException while reading RDF with B-Nodes in two-pass mode
- the ByteStringUtil.longestCommonPrefix(...) method isn't working between non ascii String and internal CharSequence
- Support query of multiple HDT files from CLI HOT 3
- Can't use Big version of the sequence HOT 1
- Fuseki integration seems broken since 3.0.0 HOT 4
- Byte strings aren't able to compare UTF32 strings
- Unsafe memory access HOT 2
- System.out.println() output HOT 5
- Question about Bitmap Triples Iterator ZFOQ implementation HOT 5
- Required array length 2147483639 + 11 is too large HOT 4
- Filtering issue HOT 3
- Problem loading wikidata: java.lang.OutOfMemoryError: Requested array size exceeds VM limit HOT 6
- Dependency resolution of hdt-java-core 3.0.5 fails HOT 6
- Using quad support with Jena/Fuseki
- W3C SPARQL 1.0 i18n normalization-02 test case fails HOT 3
- java.lang.IllegalAccessError with Fuseki 5.0
- TTL files as input to rdf2hdt produces invalid blank node IDs HOT 1
- hdtCat error in LongArrayDisk with large files HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hdt-java.