GithubHelp home page GithubHelp logo

karlb / wikdict-gen Goto Github PK

View Code? Open in Web Editor NEW
42.0 2.0 4.0 6.04 MB

Generation of bilingual dictionaries from Wiktionary/dbnary data for the WikDict project

Home Page: http://www.wikdict.com

License: MIT License

Makefile 38.45% Shell 0.26% Python 61.29%
dictionary freedict tei translation sparql wiktionary sqlite free open-data

wikdict-gen's Introduction

WikDict dictionary generator

This generator extracts data from a Virtuoso database filled with dbnary data. Details on how to set up such a database can be found in the virtuoso directory. The extracted data is then used to generate WikDict dictionaries. These dictionaries main usage is at the WikDict website.

Usage

After setting up the Virtuoso database, run

git clone [email protected]:karlb/wikdict-gen.git
cd wikdict-gen
make

Use the resulting dictionaries in dictionaries/wdweb with wikdict-web, try a quick lookup using the search command like

src/run.py search de en haus

or use the dictionaries in dictionaries/generic for any other use case.

Support

If you encounter problems when building or using dictionaries, please submit an issue or contact [email protected].

wikdict-gen's People

Contributors

karlb avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

wikdict-gen's Issues

Bad pronunciations for en:live

sqlite3 processed/en.sqlite3
sqlite> SELECT * FROM entry WHERE lexentry LIKE 'eng/live\_\_%' ESCAPE '\';
lexentry                vocable   written_rep  part_of_speech  gender  pronun_list
----------------------  --------  -----------  --------------  ------  --------------
eng/live__Adjective__1  eng/live  live         adjective       (null)  /lɪv/
eng/live__Adverb__1     eng/live  live         adverb          (null)  /lɪv/
eng/live__Verb__1       eng/live  live         verb            (null)  /laɪv/ | /lɪv/

It should be

questionable form "liveed" of en:live

sqlite3 processed/en.sqlite3
sqlite> SELECT * FROM form WHERE lexentry LIKE 'eng/live\_\_%' ESCAPE '\' AND other_written_full = 'liveed';
lexentry           other_written_full  pos   rank  number  mood        person  tense  voice   case    definiteness  inflection  other_written
-----------------  ------------------  ----  ----  ------  ----------  ------  -----  ------  ------  ------------  ----------  -------------
eng/live__Verb__1  liveed              verb  4     (null)  Participle  (null)  Past   (null)  (null)  (null)        (null)      liveed
eng/live__Verb__1  liveed              verb  3     (null)  (null)      (null)  Past   (null)  (null)  (null)        (null)      liveed

Where does this form come from? I don't see it on https://en.wiktionary.org/wiki/live
Problem visible on https://www.wikdict.com/de-en/live

sdcv can't read some file(stardict format)

I wanted to install dictionaries on kindle pw with koreader. I did it, but e-reader didn't find any word. Then I installed PC version koreader. It also didn't work. But I found out that this problem("nitems == 1 isn't true") occurs only with wikdict. I looked in a source code of sdcv. And sdcv just can't read some file, fread returns 0. I tried to generate the files myself, but "make insert" just gets stuck after it creates "bg.inserted". I know nothing about docker, virtuoso and sql, so that's not a problem I can solve myself.

Adding inflections

Hello,

thank you very much for developing this cool project! I have been working on something similar, only not based on DBnary, but instead on the Wiktextract project. Compared to your project I only have [Language]-English dictionaries, but I got the idea that you could improve your dictionaries with very little code by adding the inflection data from kaikki.org. In my project I perform some very WIP post processing, so you could also in theory take my inflection data from the published TSVs (in some cases like Spanish they are a clear improvement, in others likely still a bit buggy).

Have a great day!

Save Wiktionary categories

Thank you for the nice dictionaries, I really appreciate your work!

I'm building a Telegram bot for learning languages and I find wikdict dictionaries very helpful.

However, I think that it would be better to preserve categories in the databases, to allow user to select only required words from a certain categories.

Docs on sql structure?

Hi there, thanks so much for providing these dictionaries!

I'm trying to build (yet another) flash card app, and I was hoping to make use of the sqlite dictionaries. One goal is to focus primarily on the most commonly used words in a language and I'm wondering if any of the data provided in the dictionary covers this. I see score and importance, but when I order by those they don't quite match what I would expect to see as the most commonly used words. Is there any documentation of what each of the fields and tables represent? I'm also interested in understanding what the other columns represent as well.

StarDict files missing words

I tested some dictionaries from WikDict and I think there's something wrong with the StarDict files. For example, I downloaded all dictionaries involving {de,en,it,pt}. On the website, searching for "potato" yields results in all other three languages, but on the downloaded dictionaries only en-pt had a result:

$ sdcv -lx2 .
Dictionary's name   Word count
português-Deutsch FreeDict+WikDict dictionary (pt-de)    8773
Deutsch-italiano FreeDict+WikDict dictionary (de-it)    27580
português-italiano FreeDict+WikDict dictionary (pt-it)    8693
português-English FreeDict+WikDict dictionary (pt-en)    12635
italiano-English FreeDict+WikDict dictionary (it-en)    22614
italiano-Deutsch FreeDict+WikDict dictionary (it-de)    10367
italiano-português FreeDict+WikDict dictionary (it-pt)    10296
English-português FreeDict+WikDict dictionary (en-pt)    47632
English-italiano FreeDict+WikDict dictionary (en-it)    44403
English-Deutsch FreeDict+WikDict dictionary (en-de)    55852
Deutsch-English FreeDict+WikDict dictionary (de-en)    54541
Deutsch-português FreeDict+WikDict dictionary (de-pt)    14814
$ sdcv -ex2 . potato
Found 1 items, similar to potato.
-->English-português FreeDict+WikDict dictionary (en-pt)
-->potato

<div><i>noun</i><br><font color="green">//pəˈteɪ.toʊ//</font>, <font color="green">//pəˈteɪ.təʊ//</font><br>
batata, batatinha — plant tuber eaten as starchy vegetable</div>

I had similar results with some other common words.

Also, the StarDict files haven't been updated since April 2021, maybe just regenerating them would fix it.

"light" missing in en->el

I would expect results for the following query, but there are none:

sqlite3 en-el.sqlite3
sqlite> SELECT * FROM translation WHERE lexentry LIKE 'eng/light\_\_%' ESCAPE '\' LIMIT 20;

There are plenty of translations to Greek in https://en.wiktionary.org/wiki/light which should show up.

"pomme" not in typeahead

When typing "pomme" in de-fr, the French word "pomme" should be in the typeahead suggestions. It is probably not included due to suboptimal sorting of the typeahead candidates.

Missing translations that are not assigned to a sense

Currently, the assumption is that for each Wiktionary, translations are either:

  • assigned to a sense or
  • assigned to the lexical entry, but they contain a gloss that describes the "sense" for the specific translations

Unfortunately, this does not allow handling of all translations. In the Spanish Wiktionary, translations are in a single section for the lexentry (so no directly assigned to a sense), but often contain a numeric reference (e.g. "[2]") to identify the sense. dbnary is smart enough to parse these and assign the translation to the sense in that case.

However, not all translations have these numeric sense references and therefore stay linked to the lexentry. These translations are currently lost to WikDict. It would make sense to include these translations with an empty sense/gloss.

Example:
https://es.wiktionary.org/wiki/monje

zcat ttl/es_dbnary_*.ttl.gz | awk 'BEGIN {RS=""} /_tr_.*monje/ {print "\n"$0}'

spa:__tr_deu_1_monje__sustantivo_masculino__1
        rdf:type                dbnary:Translation;
        dbnary:isTranslationOf  spa:monje__sustantivo_masculino__1;
        dbnary:targetLanguage   lexvo:deu;
        dbnary:writtenForm      "Mönch"@de .

spa:__tr_bre_1_monje__sustantivo_masculino__1
        rdf:type                dbnary:Translation;
        dbnary:isTranslationOf  spa:monje__sustantivo_masculino__1;
        dbnary:targetLanguage   lexvo:bre;
        dbnary:writtenForm      "manac'h"@br .

spa:__tr_eng_1_monje__sustantivo_masculino__1
        rdf:type                dbnary:Translation;
        dbnary:isTranslationOf  spa:monje__sustantivo_masculino__1;
        dbnary:targetLanguage   lexvo:eng;
        dbnary:writtenForm      "monk"@en .

Make the wdweb databases available for download?

Hey

I want to create an autocompletion integration between CodeMirror and the wikdict-database, i.e. use autocomplete to translate words.

From your code it seems like the translation_block table in the wdweb database would be the easiest way to go from an English word to a German translation + its 'senses'.

Would it be possible to make that database available for download?

Bad inferred translation Power (de) -> alimenter (fr)

sqlite> SELECT * FROM infer WHERE from_vocable='Power' AND to_vocable='alimenter';
from_lang   to_lang     lexentry                  sense_num   sense                   from_vocable  to_vocable  sources     source_details  score       from_importance   to_importance
----------  ----------  ------------------------  ----------  ----------------------  ------------  ----------  ----------  --------------  ----------  ----------------  -----------------
de          fr          deu/Power__Substantiv__1  01          Energie, Kraft, Stärke  Power         alimenter   indirect    en:power        1           0.43392278930237  0.905266136895853
s

It's not totally wrong, since power button (en) = bouton d'alimentation (fr). But with this sense, it is at least misleading.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.