GithubHelp home page GithubHelp logo

Comments (17)

RubenVerborgh avatar RubenVerborgh commented on August 12, 2024 1

@migalkin Thanks to @mielvds, the downloads are now available at http://downloads.linkeddatafragments.org/hdt/fedx/

from server.js.

migalkin avatar migalkin commented on August 12, 2024

Oh, my bad, the issue is apparently connected to the Client.js, not Server.js please move it to the Client.js repo :)

from server.js.

RubenVerborgh avatar RubenVerborgh commented on August 12, 2024

@migalkin Thanks for reporting. In this case, the error is in the input file: the URI is invalid. For performance reasons, the server makes the assumption that the HDT file is built from a valid RDF input file.

When generating the HDT file, please use rdf2hdt -f turtle in.nt out.hdt. Even though your input is N-Triples, the Turtle parser (actually SERD) is doing a more precise job than the built-in parser.

My colleague @mielvds can probably tell you what we did to make the KEGG dataset valid.

from server.js.

migalkin avatar migalkin commented on August 12, 2024

@RubenVerborgh thanks for the insights, I tried the Turtle parser but the result is still the same.
I found actually a lot of such URIs with a spacebar in KEGG/Bio2RDF, so fixing it manually is apparently not the best way

from server.js.

mielvds avatar mielvds commented on August 12, 2024

Perhaps @hariharshankar can provide some insight? He has created a lot of HDT files from dirty RDF datasets

from server.js.

RubenVerborgh avatar RubenVerborgh commented on August 12, 2024

@migalkin Strange, I expect the SERD parser to fail on URIs with a space. If not, that's a but in the SERD parser… do you have the latest version installed? Just for clarity: I'm talking about the C++ version of the HDT utility; I don't know about the Java utility.

Fixing manually is certainly not the best way. I have treated similar datafiles in the past with just a simple sed (if spaces are the only problem).

from server.js.

migalkin avatar migalkin commented on August 12, 2024

@RubenVerborgh Yes, I use the latest C++ version available on github. Though the original dataset is in N3, first I had to use rdf2rdf parser to transform it to NT and then apply the C++ HDT tool. None of them threw any error.

If you have a script to eliminate spaces in URIs that would be great as my knowledge of reg exps is not enough for targeting only uris within <> and not affecting string literals for example.

from server.js.

thadguidry avatar thadguidry commented on August 12, 2024

@migalkin Perhaps your dealing with a non-breaking space that's affecting the parser ? I've had to deal with Unicode ALOT and sometimes you run into these situations. Here's a writeup I did for OpenRefine that describes the problem. https://github.com/OpenRefine/OpenRefine/wiki/Recipes#question-marks--showing-in-your-data

from server.js.

RubenVerborgh avatar RubenVerborgh commented on August 12, 2024

Strange, serdi fails for me with invalid IRI character ` ' (escape %20).

This sed script should help you out:

sed 's/<\([^>]*\) \([^>]*\)>/<\1%20\2>/g' input.nt > input_fixed.nt

from server.js.

hariharshankar avatar hariharshankar commented on August 12, 2024

The scripts I use to check and clean DBpedia dataset in NTriples format are here: https://bitbucket.org/hariharshankar/dbpedia_hdt. We mainly tailored the script to deal with issues in the DBpedia dataset.

Specifically, this is the script that contains regexes that you were interested in:
https://bitbucket.org/hariharshankar/dbpedia_hdt/src/dc11f79ebecbd5f7de28c336dfcb76436eb17a66/dbp_clean.py?at=master&fileviewer=file-view-default

Hope that helps.

from server.js.

migalkin avatar migalkin commented on August 12, 2024

Thank you for your suggestions

@RubenVerborgh the script returns a different entity, e.g., for

<http://bio2rdf.org/pdb-ccd:NAD NAJ>

the result it

<D%20J>

I'm afraid it's an invalid URI again. Is there any way to retain the full string?

@hariharshankar
Executed the script and it (according to the docs) deleted all the 'dirty' triples.
I agree it might be a solution (apparently how LOD Laundromat works), but for us it is a problem as we are comparing federated query engines and all deleted for LDF triples might be still returned by some other RDF query engines (which use the original 'dirty' dataset), so it affects the cardinality of the answer and therefore makes the evaluation unreliable.

@thadguidry
Thanks for pointing this out, I'll try to apply.

from server.js.

RubenVerborgh avatar RubenVerborgh commented on August 12, 2024

Sorry, should have been:

sed 's/<\([^>]*\) \([^>]*\)>/<\1%20\2>/g' input.nt > input_fixed.nt

Also corrected it above.

from server.js.

RubenVerborgh avatar RubenVerborgh commented on August 12, 2024

Actually, is it FedX that you are running? Because several of its datasets have problems, and because of that, there is some uncertainty over what the correct results are. @mielvds has cleaned the datasets and converted them to HDT. We can send them to you if you like (and perhaps, we should just offer this on our website, as this is a common use case).

from server.js.

migalkin avatar migalkin commented on August 12, 2024

@RubenVerborgh
Thank you for the script, it helped me a lot.
In some cases, e.g.,

<http://bio2rdf.org/ec:Acting on the CH-OH group of donors;> .

it replaces only the last space like

<http://bio2rdf.org/ec:Acting on the CH-OH group of%20donors;> .

so I just wrote a short additional script to fix found erroneous lines.
HDT parser worked fine and I can query the LDF server without any errors/warnings.

Indeed, we use Fedbench as one of the benchmarks. Frankly, I can't imagine why they decided to publish such an invalid data as a benchmark (wining, wining).
We would be very grateful if you could share cleaned dumps (including original NTriples dumps as we'll have to upload them to non-HDT stores).

from server.js.

RubenVerborgh avatar RubenVerborgh commented on August 12, 2024

@mielvds Do you remember where we've put those dumps? I'll put them on the website.

from server.js.

mielvds avatar mielvds commented on August 12, 2024

I'll get back to you on that.

from server.js.

mielvds avatar mielvds commented on August 12, 2024

from server.js.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.