GithubHelp home page GithubHelp logo

zeichenkette / dgd2cmdi Goto Github PK

View Code? Open in Web Editor NEW
0.0 3.0 2.0 6.32 MB

dgd metadata and resource conversion to clarin cmdi:

Home Page: http://fkuhn.github.io/dgd2cmdi/

License: BSD 3-Clause "New" or "Revised" License

Python 53.54% XSLT 46.46%

dgd2cmdi's Introduction

DGD2CMDI

Build Status Code IssuesDocumentation Status License

About

Processing Metadata for the Archive of Spoken German.

Folder structure

config/

Contains an example resource configuration file.

doc/

Contains apidoc and further documentation.

saxon/

The xsl processor

src/

Contains the package's sources.

tests/

Contains some simple tests for travis and pytest.

xslt/

The stylesheets used for data conversion.

Installation and Setup

  1. Use python setup.py install to install the package.

  2. You will also need an XSLT processor (e.g. SAXON-HE):

ẁget https://sourceforge.net/projects/saxon/files/Saxon-HE/9.7/SaxonHE9-7-0-8J.zip

unzip -d saxon SaxonHE9-7-0-8J.zip

  1. The XSLT stylesheets must be present and referable.

  2. Create a file in .yml for resource reference, e.g. called "resources.yml". An example file can be found in config/.

  3. Define your resources to be processed by using the yml layout as shown in the sample resource file in samples/

Usage

dgd2cmdi installs as a cli command dgd2cmdi and can be run from the shell. It requires a configuration file (see setup) to be passed as first argument. For example: dgd2cmdi resources_configuration_.yml The program parses all resources, first stores them as intermediate representation and in a final step adds dependencies. You can define where the files are written by altering the configuration file. Default path for the intermediate format is /tmp/intermediate_cmdi

Configuration File Layout

There is a default configuration file named resources.yml located in the main directory. The configuration file follows the yml format and is structured as follows:


# xslt processor and stylesheet path declaration
processor: "saxon/saxon.jar"
# add the path to the xsl stylesheets
stylesheets:
    corpus: "path/to/corpus.xsl"
    event: "path/to/event.xsl"
    speaker: "path/to/speaker.xsl"

# output paths declarations
output-inter: "path/to/intermediate/transformation/output"
output-final: "path/to/finalized/output"

# resource collection declaration
collection:
    PF:
        corpus: "path/to/corpus/catalogue/file/corpus.xml" # the catalogue file
        event: "path/to/events/"  # the directory containing all of the corpus' events
        speaker: "path/to/speakers" # the directory containing all of the corpus' speakers

# add more corpora here following the convention:
# CORPUSLABEL:
        corpus:
        event:
        speaker:

dgd2cmdi's People

Watchers

 avatar  avatar  avatar

Forkers

denis-arnold

dgd2cmdi's Issues

move to dev branch

avoid changes to master when editing documentation and tutorial notebooks

prevent redundant finalizing step

when building the final output, restrict this step to the corpora in the configuration yml file.
Currently, the intermediate output folders are parsed and every resource found is finalized again.

multi session processing of speaker information

When adding additional speaker information to a speaker element found in an event, the method speaker2event() will iterate over all speaker-elements via xpath('//Speaker') found in the current session of the event file and check, whether the label of the iterated speaker-elements and the selected speaker are the same. If yes, it will add the additional information to the speaker-element by adding them as sub-element.
After adding the information, speaker2event() sets a triple with event, speaker and label of speaker-element of the event. This triple is used check if the speaker already has been added to the event to prevent multiple entries of the additional information.
However, there might be the case that an events holds more than one recording session and therefore a speaker can be part of these two sessions. According to the conditional defined above, additional information can be written to one matching-speaker element. A second session with the same speaker will then not get the additonal information.

Possible solution:

Add an outer loop to iterate over Sessions with xpath('//Session') and set the stop-condition for sessions, not for events.

method is named speaker2event_session()

profile update

implement a method to update the profile of a cmdi file are finalization.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.