GithubHelp home page GithubHelp logo

rsdc2 / pyepidoc Goto Github PK

View Code? Open in Web Editor NEW
1.0 2.0 0.0 1.04 MB

Python library for handling TEI EpiDoc files

License: MIT License

Python 76.39% Jupyter Notebook 23.61%
epidoc-xml-markup python tei-xml epidoc

pyepidoc's Introduction

PyEpiDoc (α)

PyEpiDoc is a Python (>=3.9) library for parsing and interacting with TEI XML EpiDoc files. It has been tested on Python 3.9 on Linux (Ubuntu). It should work on later Python versions.

PyEpiDoc has been designed for use, in the first instance, with the I.Sicily corpus. For information on the encoding of I.Sicily texts in TEI EpiDoc, see the I.Sicily GitHub wiki.

NB: PyEpiDoc is currently under active development.

Install (no dev dependencies)

Locally

To install PyEpiDoc along with its dependencies (lxml):

  1. Clone or download the repository;

  2. Navigate into the cloned / downloaded repository.

  3. From within the cloned repository, install at the user level with:

    pip install . --user
    

In a virtual environment

If you are using a venv virtual environment:

  1. Make sure the virtual environment has been activated, e.g. on Linux:

    source env/bin/activate

  2. Install with pip:

    pip install .

Uninstall

pip uninstall pyepidoc

Install for development

To install PyEpiDoc along with its dependencies (lxml) and development dependencies (pytest, mypy), e.g. in a virtual environment:

  1. Clone or download the repository;

  2. Navigate into the cloned / downloaded repository.

  3. From within the cloned repository, install with:

    pip install .[dev]
    

Running the Jupyter Notebooks

A couple of Jupyter notebooks are included in the repository to provide example usage:

  • getting_started.ipynb
  • abbreviations.ipynb

For instructions on installing Jupyter notebook, see https://docs.jupyter.org/en/latest/install/notebook-classic.html. Alternatively, see also https://jupyter.org/install.

Once Jupyter notebook is installed, to run getting_started.ipynb, type:

jupyter notebook getting_started.ipynb

Example usage

Given a tokenized EpiDoc file ISic000001.xml in an examples/ folder in the current working directory.

Load the EpiDoc file

from pyepidoc import EpiDoc

doc = EpiDoc("examples/ISic000001_tokenized.xml")

Print the text of the edition

print(doc.edition_text)

Print all tokens in an edition (e.g. <w>, <name> etc.)

tokens = doc.tokens
print(' '.join([str(token) for token in tokens]))

Produce a tokenized version of a given EpiDoc file

Given an untokenized EpiDoc file ISic000032_untokenized.xml in an examples folder in the current working directory:

from pyepidoc import EpiDoc

# Load the EpiDoc file
doc = EpiDoc("examples/ISic000032_untokenized.xml")

# Tokenize the edition with default settings
doc.tokenize()

# Print list of tokens
print('Tokens: ', doc.tokens_list_str)

# Save the results to a new XML file
doc.to_xml_file("examples/ISic000032_tokenized.xml")

Corpus level analysis

Given a corpus of EpiDoc XML files in a folder corpus/ in the current working directory, the following code filters the corpus and writes a text file containing the ids of all Latin funerary inscriptions from Catania / Catina:

from pyepidoc import EpiDocCorpus
from pyepidoc.epidoc.enums import TextClass
from pyepidoc.file.funcs import str_to_file

# Load the corpus
corpus = EpiDocCorpus('corpus')

# Filter the corpus to find the funerary inscriptions
funerary_corpus = corpus.filter_by_textclass([TextClass.Funerary.value])

# Within the funerary corpus, find all the Latin inscriptions from Catania / Catina:
catina_funerary_corpus = (
    funerary_corpus
        .filter_by_orig_place(['Catina'])
        .filter_by_languages(['la'])
)

# Output the of this set of documents to a file ```catina_funerary_ids_la.txt``` 
# in the current working directory.
catina_funerary_ids = '\n'.join(catina_funerary_corpus.ids)
str_to_file(catina_funerary_ids, 'catina_funerary_ids_la.txt')

Validate EpiDoc XML

There are two ways to validate an EpiDoc XML file:

  1. Validate on load, e.g.:
from pyepidoc import EpiDoc

doc = EpiDoc('examples/ISic000001_tokenized.xml', validate_on_load=True)
  • This validates according to the RelaxNG schema tei-epidoc.rng in the pyepidoc root directory.
  • By default validate_on_load is set to False.
  1. Validate against a custom RelaxNG schema:
from pyepidoc import EpiDoc
doc = EpiDoc('examples/ISic000001_tokenized.xml')

doc.validate_by_relaxng(fp='path/to/relaxngschema.rng')

Code organisation

Package structure

The PyEpiDoc package has four subpackages:

  • xml containing modules with base classes for XML handling;
  • epidoc containing modules for handling EpiDoc specific XML handling, e.g. <ab>, <w> etc.;
  • analysis containing modules for analysing EpiDoc files and corpora, e.g. of abbreviations;
  • shared containing modules and classes for use generally in the project.

Probably the most useful subpackage in the first instance will be epidoc, and in particular epidoc.py and corpus.py, which, via the classes EpiDoc and EpiDocCorpus, provide APIs to EpiDoc files and corpora respectively.

Modifying tokenizer behaviour

The treatment of a given token by the tokenizer may be affected by one or more of the following:

  • Status in pyepidoc/epidoc/epidoctypes.py
  • Presence in pyepidoc/constants.py in SubsumableRels

The token will be subsumed into a neighbouring <w> token if it is not separated by whitespace if:

  • it is listed in as a dep of e.g. <w> in SubsumableRels

The token will be subsumed into a neighbouring <w> token regardless of the presence of intervening whitespace if:

  • it is listed in as a dep of e.g. <w> in SubsumableRels and
  • it is a member of AlwaysSubsumableType in epidoctypes.py

Code integrity

Run the tests

with pytest installed (the dev installation will do this for you):

  1. Navigate to the tests/ folder.

  2. To run all the tests:

    pytest
    

If pytest is not available to the currently active version of Python, it may be necessary to specify the Python executable with pytest installed, e.g.:

```
python3.10 -m pytest
```

Check the types

To check the integrity of the type annotations, with mypy installed (the dev installation will do this for you):

mypy src/pyepidoc

If mypy is not available to the currently active version of Python, it may be necessary to specify the Python executable with mypy installed, e.g.:

```
python3.10 -m mypy src/pyepidoc
```

Features to be included in future

XML comments

XML comments should now be handled correctly, and reproduced in new files.

Dependencies

PyEpiDoc depends on lxml (BSD). Development dependencies are mypy (MIT) and pytest (MIT). Licenses for these dependencies are included in the LICENSES directory.

Acknowledgements

The software for PyEpiDoc was written by Robert Crellin as part of the Crossreads project at the Faculty of Classics, University of Oxford, and is licensed under the BSD 3-clause license. This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 885040, “Crossreads”).

Example and test .xml files, contained in the examples/ and tests/ subfolders, as well as in the source code, are either directly form, or derived from, the I.Sicily corpus, which are made available under the CC-BY-4.0 licence.

For further details and acknowledgements on the generation of ISicily token IDs (pyepidoc/epidoc/ids), see https://github.com/rsdc2/ISicID.

pyepidoc's People

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.