GithubHelp home page GithubHelp logo

ryderwishart / ancient-greek-word2vec Goto Github PK

View Code? Open in Web Editor NEW
6.0 3.0 0.0 308.04 MB

Global vector modelling notebooks for Ancient Greek

License: MIT License

Jupyter Notebook 100.00%
ancient-greek greek-new-testament hellenistic machine-learning nlp word2vec

ancient-greek-word2vec's Introduction

Ancient Greek Word2Vec

This repository contains Jupyter notebooks for training word2vec and querying models for Ancient Greek.

Dependencies

You will need to install Gensim to use the notebooks in this repo. Both notebooks attempt to install it in case it is missing:

try:
    from collections.abc import Mapping
    from gensim.models.word2vec import Word2Vec
except:
    !pip install -I gensim
    from collections.abc import Mapping
    from gensim.models.word2vec import Word2Vec

Before running these scripts, therefore, you should ensure you are working in a virtual env using Conda or some alternative.

Get most similar terms

The notebook query_w2v_model.ipynb contains code to load up an existing model and query it. Once the code runs successfully through, you can query additional terms by imitating the example code.

For example, querying 'ἀρχιερεύς' ('high priest') using model.most_similar('ἀρχιερεύς', topn=10) generates the following results:

[('ἱερεύς', 0.6741834878921509), # 'priest' is 67% similar
 ('πατριάρχης', 0.6512661576271057), # 'patriarch' is 65% similar
 ('διάκονος', 0.6433295011520386), # 'deacon'/'servant' is 64% similar
 ('ἰούδας', 0.6175312399864197), # 'Jude'/'judas' is 61% similar, etc.
 ('ἀρχιερωσύνη', 0.6078990697860718),
 ('χρηματίζω', 0.5876879096031189),
 ('χειροτονέω', 0.577594518661499),
 ('ἀαρών', 0.5753183364868164),
 ('πιλᾶτος', 0.572640061378479),
 ('ἀναγορεύω', 0.5716601610183716)]

Train your own model

The notebook build_greek_w2v_model.ipynb contains code to train your own model, including specifying a data directory and setting hyperparameters such as vector size, context window size, etc.

You can also change the neural network type (either skip-gram or continuous bag-of-words). Anecdotally, I find the CBOW to work better for lemmatized comparisons. For example, for the term λόγος ('word'/'reason'/'account'/etc.), compare the following results:

# Skip-gram

[('ἀξιομνημόνευτος', 0.6042230129241943),
 ('ἐφεκτέον', 0.6000658869743347),
 ('ψευδοδοξία', 0.589237630367279),
 ('λογοποιία', 0.5858883261680603),
 ('ἐξεταστικός', 0.5815737247467041),
 ('ἀκριβολογία', 0.5795413851737976),
 ('ὑφήγησις', 0.5784241557121277),
 ('ἀναμφίλεκτος', 0.5765267014503479),
 ('ἰσοσθένεια', 0.5729403495788574),
 ('εἰσαγωγικός', 0.5722008347511292)]

# CBOW

[('ἑρμηνεία', 0.5288925766944885),
 ('ἐξήγησις', 0.5065293312072754),
 ('θεολογία', 0.4956413805484772),
 ('διδασκαλία', 0.48970848321914673),
 ('ὑπόληψις', 0.47546130418777466),
 ('διήγησις', 0.47397667169570923),
 ('σκέψις', 0.47282832860946655),
 ('θεωρία', 0.47240814566612244),
 ('διαλέγω', 0.4657094478607178),
 ('φυσιολογία', 0.4602492153644562)]

The skip-gram approach may work better with a higher min_count_input (10? 100?) to exclude rare lemmas.

Theoretically you could use just about any language data in plaintext format as input. The data is contained in the data/corpus subdirectory, and consists of a set of plaintext files that end with the .txt extension.

The format for the input data is one-sentence per line.

Corpora

The data included in this repository is lemmatized, with 19,053,248 lemmas in data/corpus (check using cat data/corpus/*.txt | wc -w in the repository root).

The files under data/papyri includes 1,759,488 lemmas (check using find data/papyri -type f -name "*.txt" -exec cat {} + | wc -w to avoid the 'argument list too long' error).

This data has been extracted from LemmatizedAncientGreekXML and MALP. Thank you, Giuseppe!

TODO: There are some zero-byte files in the corpus. Presumably, no lemmas were extracted for these files. This should be investigated in order to increase the corpus size.

TODO: Add ngram detection preprocessing option

TODO: Add FastText sub-word model

ancient-greek-word2vec's People

Contributors

ryderwishart avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

ancient-greek-word2vec's Issues

Addition

Hi!

I tried your model because mine is not very good in sense addition like king+woman-man=queen the classic example for word2vec.

So I tried:
βασιλεύς + γυνή - ἀνήρ = γνόφον σπέρματί εἰσελεύσῃ σπεύσουσιν αἴτησαι κατασφάζω θυγάτηρ παῖς
and with yours:
βασιλεύς + γυνή - ἀνήρ = βασιλίζω μαρδοχαῖος ἐζεκίας ἀδελφή ἀρταξέρξης γαμετή καλλιρρόη ἡρῴδης

I chose a window of 8. Did you succeeded in vector addition?

Yours
my repo is https://github.com/l0d0v1c/Ancient-greek-word2vec

xml to txt

Dear Ryder Wishart!

I came across your repository in search for code to parse the xml-files of the First1KGreek.

You write on the Readme that you got the xml-files from LemmatizedAncientGreekXML and MALP.

If it is okay to ask: Would you share the code for converting the xml to .txt with me or show where to find it?
(I have struggled for some days now to parse a few texts (Polycarps Martyrdom, Acts of John, Phillip, Barnabas, and Thomas), but there remains so much noise in the text.))

For the record, I think your projects look really interesting, and I would like to connect at some point. I am in Denmark (Europe) trying to figure out to implement nlp-techniques in my common teaching :-)

best of regards,
Christian

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.