Comments (17)
@migalkin Thanks to @mielvds, the downloads are now available at http://downloads.linkeddatafragments.org/hdt/fedx/
from server.js.
Oh, my bad, the issue is apparently connected to the Client.js, not Server.js please move it to the Client.js repo :)
from server.js.
@migalkin Thanks for reporting. In this case, the error is in the input file: the URI is invalid. For performance reasons, the server makes the assumption that the HDT file is built from a valid RDF input file.
When generating the HDT file, please use rdf2hdt -f turtle in.nt out.hdt
. Even though your input is N-Triples, the Turtle parser (actually SERD) is doing a more precise job than the built-in parser.
My colleague @mielvds can probably tell you what we did to make the KEGG dataset valid.
from server.js.
@RubenVerborgh thanks for the insights, I tried the Turtle parser but the result is still the same.
I found actually a lot of such URIs with a spacebar in KEGG/Bio2RDF, so fixing it manually is apparently not the best way
from server.js.
Perhaps @hariharshankar can provide some insight? He has created a lot of HDT files from dirty RDF datasets
from server.js.
@migalkin Strange, I expect the SERD parser to fail on URIs with a space. If not, that's a but in the SERD parser… do you have the latest version installed? Just for clarity: I'm talking about the C++ version of the HDT utility; I don't know about the Java utility.
Fixing manually is certainly not the best way. I have treated similar datafiles in the past with just a simple sed
(if spaces are the only problem).
from server.js.
@RubenVerborgh Yes, I use the latest C++ version available on github. Though the original dataset is in N3, first I had to use rdf2rdf parser to transform it to NT and then apply the C++ HDT tool. None of them threw any error.
If you have a script to eliminate spaces in URIs that would be great as my knowledge of reg exps is not enough for targeting only uris within <> and not affecting string literals for example.
from server.js.
@migalkin Perhaps your dealing with a non-breaking space that's affecting the parser ? I've had to deal with Unicode ALOT and sometimes you run into these situations. Here's a writeup I did for OpenRefine that describes the problem. https://github.com/OpenRefine/OpenRefine/wiki/Recipes#question-marks--showing-in-your-data
from server.js.
Strange, serdi
fails for me with invalid IRI character ` ' (escape %20)
.
This sed script should help you out:
sed 's/<\([^>]*\) \([^>]*\)>/<\1%20\2>/g' input.nt > input_fixed.nt
from server.js.
The scripts I use to check and clean DBpedia dataset in NTriples format are here: https://bitbucket.org/hariharshankar/dbpedia_hdt. We mainly tailored the script to deal with issues in the DBpedia dataset.
Specifically, this is the script that contains regexes that you were interested in:
https://bitbucket.org/hariharshankar/dbpedia_hdt/src/dc11f79ebecbd5f7de28c336dfcb76436eb17a66/dbp_clean.py?at=master&fileviewer=file-view-default
Hope that helps.
from server.js.
Thank you for your suggestions
@RubenVerborgh the script returns a different entity, e.g., for
<http://bio2rdf.org/pdb-ccd:NAD NAJ>
the result it
<D%20J>
I'm afraid it's an invalid URI again. Is there any way to retain the full string?
@hariharshankar
Executed the script and it (according to the docs) deleted all the 'dirty' triples.
I agree it might be a solution (apparently how LOD Laundromat works), but for us it is a problem as we are comparing federated query engines and all deleted for LDF triples might be still returned by some other RDF query engines (which use the original 'dirty' dataset), so it affects the cardinality of the answer and therefore makes the evaluation unreliable.
@thadguidry
Thanks for pointing this out, I'll try to apply.
from server.js.
Sorry, should have been:
sed 's/<\([^>]*\) \([^>]*\)>/<\1%20\2>/g' input.nt > input_fixed.nt
Also corrected it above.
from server.js.
Actually, is it FedX that you are running? Because several of its datasets have problems, and because of that, there is some uncertainty over what the correct results are. @mielvds has cleaned the datasets and converted them to HDT. We can send them to you if you like (and perhaps, we should just offer this on our website, as this is a common use case).
from server.js.
@RubenVerborgh
Thank you for the script, it helped me a lot.
In some cases, e.g.,
<http://bio2rdf.org/ec:Acting on the CH-OH group of donors;> .
it replaces only the last space like
<http://bio2rdf.org/ec:Acting on the CH-OH group of%20donors;> .
so I just wrote a short additional script to fix found erroneous lines.
HDT parser worked fine and I can query the LDF server without any errors/warnings.
Indeed, we use Fedbench as one of the benchmarks. Frankly, I can't imagine why they decided to publish such an invalid data as a benchmark (wining, wining).
We would be very grateful if you could share cleaned dumps (including original NTriples dumps as we'll have to upload them to non-HDT stores).
from server.js.
@mielvds Do you remember where we've put those dumps? I'll put them on the website.
from server.js.
I'll get back to you on that.
from server.js.
from server.js.
Related Issues (20)
- Error: Cannot find module 'rdf-string' in core package HOT 2
- Invalid path to components file HOT 12
- TypeError: Cannot set property graph of #<Quad> which has only a getter HOT 2
- Dependency Dashboard
- No matches in readme data example HOT 4
- how to setup a local fuseki server ? HOT 5
- Only first page is returned when using proxy HOT 2
- Composite data source is empty in 3.x (worked in 2.x) HOT 2
- Wikidata SPARQL endpoint not working HOT 7
- Shouldn't it be SPARQL Endpoint instead of HDT file? HOT 1
- Is it possible to add the server as a dependency of another server? HOT 1
- Invalid filename for HDT source HOT 2
- Consistently use file to refer to file paths (and not e.g. htdFile).
- Replace Request dependency HOT 3
- Does the server support searching/filtering by string literals' language tags? HOT 1
- `void:triples` has page as subject instead of dataset. HOT 1
- SIGTERM no longer reloads data properly - just adds more workers HOT 2
- Uncontrolled Resource Consumption in parse-link-header HOT 3
- Server-Side Request Forgery in Request
- Checking WAC permissions with LDF ? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from server.js.