GithubHelp home page GithubHelp logo

Comments (12)

simongray avatar simongray commented on July 23, 2024

You might have luck converting the DanNet RDF dataset to WN-LMF using this tool by John McCrae: https://github.com/jmccrae/gwn-scala-api

DanNet is RDF in the Turtle serialisation.

from dannet.

simongray avatar simongray commented on July 23, 2024

I just tried to run it myself and unfortunately didn't have much luck: jmccrae/gwn-scala-api#23

For the record, I tried to run:

./gwn -i dannet.ttl -o dannet-lmf.xml -f RDF -t WNLMF --input-rdf-lang TURTLE

from dannet.

hallundbaek avatar hallundbaek commented on July 23, 2024

I gave that a go, and through changing some naming of the Lexicon to match this example, I got it to at least convert without errors.

But at that point the output was only the Lexicon but none of the data within it.

I also tried running the conversion on the english wordnet ttl, which could not convert either.

Since it is the same author, this lead me to conclude that gwn was probably deprecated, since dogfooding the english wordnet wasn't supported.

Which lead me to opening this issue, hoping that it would be a simple artifact for you to produce.

from dannet.

simongray avatar simongray commented on July 23, 2024

I started trying to implement a WN-LMF export this morning: https://github.com/kuhumcst/DanNet/tree/feature/136-wn-lmf

Just so you don't duplicate my efforts. I suspect it'll be done by the weekend. I'll let you know so that you can beta-test the WN-LMF file and then it'll be part of the official dataset releases from then on.

from dannet.

simongray avatar simongray commented on July 23, 2024

@hallundbaek Please let me know if this works: dannet-wn-lmf.zip

from dannet.

hallundbaek avatar hallundbaek commented on July 23, 2024

Great! I did not expect such a short turnaround on this, it is very much appreciated! Thanks!

I tried importing it using wn.add('path.to.file') but it did not initially parse. Apparently because goodmami/wn expects the first two lines to be exactly:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE LexicalResource SYSTEM "http://globalwordnet.github.io/schemas/WN-LMF-1.1.dtd">

or

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE LexicalResource SYSTEM "http://globalwordnet.github.io/schemas/WN-LMF-1.0.dtd">

The first line missing due to an improper parsing of xml on their part, and the second from a missing doctype declaration.

In any case, with those lines added it started importing the file.

Unfortunately it failed when importing the <Synset> elements, specifically when they had a child <SynsetRelation> with a target key referring to a synset identifier that did not have its own <Synset> element.

To fix that issue I compiled all ids referenced in target keys, and a second list of all values of <Synset>s id keys. I then found the difference of the two, highlighting exactly those targets that did not have a corresponding <Synset> element. Using that difference I removed all of the <SynsetRelation>s that referred to non-existing <Synset>s

At that point I got the xml file to get imported! And from a few queries through the python interface, it seems like it works!

I've uploaded the xml file that I got imported, alongside the list of synset ids that were only referenced in <SynsetRelation> targets.

dannet-goodmami-wn-compat.xml.gz
unrefd-syns.txt

Though ideally, the missing referenced <Synset>s would be nice to have! From testing a few of them on https://wordnet.dk/dannet/data/<id> they did show up, indicating that they somehow get lost on export to WN-LMF.

On another note: goodmami/wn does not support loading .zip files but only .gz files or raw .xml files, as such it would be preferable if the official release was either .gz or .xml.

Once it is available for release, it would be good to make a PR for goodmami/wn such that DanNet can be listed as an officially supported wordnet there, making it much easier to import while creating awareness of its compatibility. I'd be happy to offer my assistance if needed.

from dannet.

simongray avatar simongray commented on July 23, 2024

Thank you for that excellent feedback. I'll take a look at it next time I'm at work.

Great! I did not expect such a short turnaround on this, it is very much appreciated! Thanks!

No problem! Let's say that it's the data transformation magic of Clojure/LISP combined with the fact that most of my colleagues are away at various conferences.

from dannet.

simongray avatar simongray commented on July 23, 2024

The first line missing due to an improper parsing of xml on their part, and the second from a missing doctype declaration.

In any case, with those lines added it started importing the file.

Thanks, that's good to know.

Though ideally, the missing referenced s would be nice to have! From testing a few of them on https://wordnet.dk/dannet/data/ they did show up, indicating that they somehow get lost on export to WN-LMF.
I see.

Yeah, I'm not sure what's going on here. Let me investigate this further. It's probably some logic error in my SPARQL query.

On another note: goodmami/wn does not support loading .zip files but only .gz files or raw .xml files, as such it would be preferable if the official release was either .gz or .xml.

Sure... though it's a single file so decompressing it before use is surely not a huge obstacle? All of the datasets we have available for download are zipped as they would be much larger downloads otherwise.

Maybe it makes sense to make that file .gz. I'll have to think about it.

Once it is available for release, it would be good to make a PR for goodmami/wn such that DanNet can be listed as an officially supported wordnet there, making it much easier to import while creating awareness of its compatibility. I'd be happy to offer my assistance if needed.

Yes, definitely!

from dannet.

simongray avatar simongray commented on July 23, 2024

Try this one: dannet-wn-lmf.zip

Also, can you please share how you're loading these files in Python using the wn library? It would help me to debug on my end.

from dannet.

simongray avatar simongray commented on July 23, 2024

dannet-wn-lmf.zip

I've tested the following file and made sure that the XML file contained in it can be opened in goodmami/wn.

from dannet.

hallundbaek avatar hallundbaek commented on July 23, 2024

Also, can you please share how you're loading these files in Python using the wn library? It would help me to debug on my end.

Apologies for not getting back to you on this, but I'm happy you got it working.

I've also tried it and I can confirm it works! Thanks a bunch!

Sure... though it's a single file so decompressing it before use is surely not a huge obstacle? All of the datasets we have available for download are zipped as they would be much larger downloads otherwise.

Maybe it makes sense to make that file .gz. I'll have to think about it.

Yeah it wouldn't be a huge obstacle, but it would make it impossible to add to the index at goodmami/wn, since it does not support zip files. This would make it less discoverable through their documentation, in the worst case leading users to conclude that only the omw DanNet is supported, and in the best case they would have to dig further into the documentation, figure out you can import files, look up DanNet, download and unzip the zip file, and then load the file.

If zip is preferred for keeping the available download formats homogeneous, I would suggest maintaining both zip and a gz, just for compatibility with goodmami/wn.

from dannet.

simongray avatar simongray commented on July 23, 2024

@hallundbaek The gzip WN-LMF dataset is included in the latest release: https://github.com/kuhumcst/DanNet/releases/tag/v2024-06-12

from dannet.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.