GithubHelp home page GithubHelp logo

tei4htr / page2tei Goto Github PK

View Code? Open in Web Editor NEW
15.0 1.0 2.0 2.25 MB

A repository for illustrating the transformation of a PAGE XML file into XML-TEI format, resulting from experimentations made for the LECTAUREP project.

License: Creative Commons Attribution 4.0 International

XSLT 100.00%
tei pagexml xslt

page2tei's Introduction

LECTAUREP - Page2tei

This repository stores an XSLT for transforming a PAGE XML file into XML-TEI, created for the LECTAUREP (INRIA - Archives nationales), and xml files resulting from the transformation. The XSLT was modified from a first version, created by Manon Ovide (inoblivionem).

For each annotation region, <TextRegion> in a PAGE XML file, a <surface> element is created in the TEI file.

Repository tree

├── pagexml
├── tei
└── xmlpage_to_tei.xsl
  • The XSLT
  • A directory named pagexml, in which are stored PAGE XML files.
  • The directory named tei stores the TEI files resulting from the transformed PAGE XML.

Cite this work

Chagué, A., & Scheithauer, H. (2021). page2tei, an XSL Transformation to transform PAGE XML into TEI XML (Version 1.0.0) [Computer software]

page2tei's People

Contributors

alix-tz avatar hugoschtr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

page2tei's Issues

Deleting <surfaceGrp> and only keeping <surface> as text regions?

After re-reading the TEI guidelines, a <surfaceGrp> groups several written surfaces. We're currently using both <surfaceGrp> and <surface> for representing only one text region, resulting in a group of one <surface>:

<surfaceGrp xml:id="eSc_textblock_afbab800" type="structure_{type:col_1;}">
         <surface points="421,615 421,2236 465,2211 465,2266 421,2269 425,2449 410,4148 362,4213 205,4228 234,615">
            <zone xml:id="eSc_line_86b00a8e"
                  type="mask"
                  points="285,838 293,812 322,798 380,801 377,863 289,874">
               <path type="baseline" points="289,841 389,845"/>
               <line>198</line>
            </zone>
            <zone xml:id="eSc_line_4218ebcd"
                  type="mask"
                  points="278,981 285,940 311,929 380,948 384,992 359,1028 318,1028 282,1006">
               <path type="baseline" points="278,981 384,992"/>
               <line>199</line>
            </zone>
            ...
           </surface>
 <surfaceGrp xml:id="eSc_textblock_c6e3bb97" type="structure_{type:col_3;}">
    <surface points="934,612 890,4216 772,4228 577,4207 615,615">
       <zone xml:id="eSc_line_c5f75194" type="mask" points="608,841 611,750 630,743 871,750 897,728 897,867 611,863">
          <path type="baseline" points="611,845 703,838 910,840"/>
          <line>Procuration</line>
       </zone>
          ...

Maybe a <surface> element alone, still grouping one or more <zone> representing baselines, is more appropriate and less redundant for representing a text region.

See documentation: https://tei-c.org/release/doc/tei-p5-doc/en/html/ref-surface.html

<surface xml:id="eSc_textblock_afbab800"
               type="structure_{type:col_1;}"
               points="421,615 421,2236 465,2211 465,2266 421,2269 425,2449 410,4148 362,4213 205,4228 234,615">
         <zone xml:id="eSc_line_86b00a8e"
               type="mask"
               points="285,838 293,812 322,798 380,801 377,863 289,874">
            <path type="baseline" points="289,841 389,845"/>
            <line>198</line>
         </zone>
         ...
</surface
<surface xml:id="eSc_textblock_c6e3bb97"
               type="structure_{type:col_3;}"
               points="934,612 890,4216 772,4228 577,4207 615,615">
         <zone xml:id="eSc_line_c5f75194"
               type="mask"
               points="608,841 611,750 630,743 871,750 897,728 897,867 611,863">
            <path type="baseline" points="611,845 703,838 910,840"/>
            <line>Procuration</line>
         </zone>

Handling <textRegion> without nested <Coords>

If no layout annotation exists, only one <textRegion> element is created by eScriptorium in the PAGE XML export, with an attribute `id="eSc_dummyblock_". Baselines are then nested into that element.

image

However, when this occurs, no <Coords> element is nested inside the <textRegion>. The current XSLT is fetching text regions coordinates values with said element, resulting with an invalid TEI file.

image

This exception must be handled in the XSLT.

Which transformation for PAGE XML elements ? xmlpage_to_tei.xsl v2 documentation

In the second version of the XSL, transformations (from PAGE XML to TEI) proceed as such:

For metadata:

  <Metadata>
	<Creator>escriptorium</Creator>
	<Created>2021-10-07T07:46:39.064183+00:00</Created>
        <LastChange>2021-10-07T07:46:39.064229+00:00</LastChange>
  </Metadata>

becomes:

   <teiHeader>
      <fileDesc>
         <titleStmt>
            <title>FRAN_0025_3056_L-0</title>
            <respStmt>
               <resp>Transcribed with</resp>
               <name>escriptorium</name>
            </respStmt>
         </titleStmt>
         <publicationStmt>
            <p/>
         </publicationStmt>
         <sourceDesc>
            <p/>
         </sourceDesc>
      </fileDesc>
      <revisionDesc>
         <change when="2021-10-07T07:46:39.064183+00:00">Creation</change>
         <change when="2021-10-07T07:46:39.064229+00:00">Last change</change>
      </revisionDesc>
   </teiHeader>

For the transcription itself:

  <Page imageFilename="FRAN_0025_3056_L-0.jpg" imageWidth="2894" imageHeight="4393">
...

becomes:

<sourceDoc>
      <graphic url="FRAN_0025_3056_L-0.jpg" source="" width="2894px" height="4393px"/>
...

Every <TextRegion> and every baseline (masks and baselines):

    <TextRegion id="eSc_textblock_afbab800"  custom="structure {type:col_1;}">
      <Coords points="421,615 421,2236 465,2211 465,2266 421,2269 425,2449 410,4148 362,4213 205,4228 234,615"/>
      
      
      <TextLine id="eSc_line_86b00a8e" >
        <Coords points="285,838 293,812 322,798 380,801 377,863 289,874"/>
        <Baseline points="289,841 389,845"/>
        <TextEquiv>
          <Unicode>198</Unicode>
        </TextEquiv>
      </TextLine>
...

becomes:

<surfaceGrp xml:id="eSc_textblock_afbab800" type="structure_{type:col_1;}">
         <surface>
            <zone xml:id="eSc_line_86b00a8e"
                  type="mask"
                  points="285,838 293,812 322,798 380,801 377,863 289,874">
               <line type="baseline" points="289,841 389,845">198</line>
            </zone>
...

Encoding baselines' coordinates with TEI : which attribute ?

We're currently encoding a baseline from a PAGE XML file as such :

<zone xml:id="eSc_line_26e1eb01"
                  type="mask"
                  points="1256,1599 1260,1548 1340,1548 1384,1566 1447,1548 1523,1548 1545,1566 1571,1548 1615,1548 1633,1566 1758,1555 1923,1566 1952,1552 1974,1566 2447,1563 2458,1603 2439,1625 2388,1629 1787,1629 1773,1614 1641,1629 1626,1614 1443,1614 1428,1629 1340,1614 1260,1629">
               <line type="baseline" points="1260,1603 1981,1599 2029,1607 2460,1605">c/ Edouard Eugène Lebourcq &amp;amp; Eugène Pauline Potel 104 r. St Maur</line>
            </zone>

The <zone> element corresponding to the baseline's mask, and <line> element to the baseline itself, with its text node and its coordinates.

However, with baselines coordinates way more simple, with only two x,y pairs as such :

            <zone xml:id="eSc_line_41fc839d"
                  type="mask"
                  points="293,1511 304,1478 351,1467 377,1478 384,1515 373,1548 296,1559">
               <line type="baseline" points="296,1515 385,1518">203</line>
            </zone>

The encoding become TEI invalid, because the points attribute requires at least 3 x,y pairs. See documentation : https://tei-c.org/release/doc/tei-p5-doc/en/html/ref-att.coordinated.html

How can we get around this problem ?

Summarize modelization in README

You should add a section in the README to document the behavior of the XSL and the choices made to build the structure. For more details about crucial choices that were made, you could point to the most relevant issues.

Tagging image metadata inside a facsimile element

Image metadata is currently tagged within the <sourceDoc> element with <graphic>.

<sourceDoc>
      <graphic url="FRAN_0025_3056_L-0.jpg" width="2894px" height="4393px"/>
      <surfaceGrp>
         <surface xml:id="eSc_textblock_afbab800"
                  type="structure_{type:col_1;}"
                  points="421,615 421,2236 465,2211 465,2266 421,2269 425,2449 410,4148 362,4213 205,4228 234,615">
            <zone xml:id="eSc_line_86b00a8e"
                  type="mask"
                  points="285,838 293,812 322,798 380,801 377,863 289,874">
               <path type="baseline" points="289,841 389,845"/>
               <line>198</line>
            </zone>
            ...

Instead, and for the sake of clarity, image metadata can be tagged inside the <facsimile> element:

<facsimile>
      <graphic url="FRAN_0025_3056_L-0.jpg" width="2894px" height="4393px" xml:id="FRAN_0025_3056_L-0"/>
</facsimile>
<sourceDoc>
      <surfaceGrp facs="#FRAN_0025_3056_L-0">
         <surface xml:id="eSc_textblock_afbab800"
                  type="structure_{type:col_1;}"
                  points="421,615 421,2236 465,2211 465,2266 421,2269 425,2449 410,4148 362,4213 205,4228 234,615">
            <zone xml:id="eSc_line_86b00a8e"
                  type="mask"
                  points="285,838 293,812 322,798 380,801 377,863 289,874">
               <path type="baseline" points="289,841 389,845"/>
               <line>198</line>
            </zone>
            ...

Image metadata and transcription data would then be separated in their respective elements. With appropriate xml:id and facs attributes, multiple images could be encoded with a single TEI file.

Synch w/ zenodo

  • add DOI badge
  • add DOI in citation
  • add license
  • add zenodo metadata

Display the transcription in the body

It could be interesting to also add content in the body of the TEI XML file, even if it is pretty basic.
This way, with the import in the TEI4HTR application from TEI Publisher, the 'transcription' will not be empty and it will help the user for its transcription, because it will already have part of it, even if it is not correctly encoded.
With some extra lines in the XSLT transformation, the user could be able to fill in the body.

Enlever "structure_{type:...;}"

Ne devrait-on pas simplifier le contenu de la valeur de @type de //surface (qui correspond aux region types issues d'eScriptorium).

Passer de :

<surface xml:id="eSc_textblock_afbab800"
                  type="structure_{type:col_1;}"
                  points="421,615 421,2236 465,2211 465,2266 421,2269 425,2449 410,4148 362,4213 205,4228 234,615">

A :

<surface xml:id="eSc_textblock_afbab800"
                  type="col_1"
                  points="421,615 421,2236 465,2211 465,2266 421,2269 425,2449 410,4148 362,4213 205,4228 234,615">

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.