GithubHelp home page GithubHelp logo

schema's Introduction

This repository contains ALTO schema versions - drafts and final released ones.

All open issues and discussions about changes to the ALTO standard can be found and tracked in the issues repository

Latest official schema version is 4.4.
Primary source for the schema is (http://www.loc.gov/standards/alto/v4/alto-4-4.xsd)
Alternate source for the schema is (https://cdn.rawgit.com/altoxml/schema/master/v4/alto-4-4.xsd)

Summary of proposed changes

  • Change schema version to 4.4
  • Add LANG attribute on PageType level to describe the default language used in document
  • Add ROTATION attribute on PageType level to describe the default rotation used in document
  • Add OTHERLANGS attribute on PageType to summarize all the languages present into a particular document
  • Adapt "PointsType" documentation
  • Adapt xLink attribute group documentation on "BlockType"

Details about the changes of the version and further documentation can be found in the ALTO documentation repository.

schema's People

Contributors

artunit avatar cipriandinu avatar cneud avatar cowboymontana avatar evelienket avatar jpmoreux avatar jukervin avatar markusenders avatar mittagessen avatar stweil avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

schema's Issues

Provenance for OCRProcessing/Processing and Content

The current OCRProcessing statement is rather rudimentary in not allowing identifiers for each ProcessingStep and being able to link features in the recognition results to particular steps. For example, in our pipeline we frequently use tesseract's page segmentation with ocropus's recognition, so TextLine elements are sourced from one ProcessingStep and their text content is from another one.

A particular use case is when postprocessing like spell checkers add additional variants to String tags (something we'd like to see also) and it may be unclear if the variant is produced by the recognition engine itself or the spell checker.

Capturing complex workflow provenance

@acpopat wrote:

Some processing histories may not be simple sequential pipelines and may require a more general graph structure. As mentioned in today's board call, some OCR post-correction schemes provide examples of such processing:

  • merging results from multiple OCR engines
  • post-correction using multiple information sources
  • coalescing information from multiple page images and their OCR results

If it is desired that the results of such processing be represented in ALTO, then a more general provenance scheme capable of representing graph-structured dependencies might be required, such as that referred to by Clemens in his Aug 2 comment.

"Production family" attributes: CS, ILLS, DBTS

Use cases:

These String related attributes can be used to describe human based decisions/actions during the OCR text correction process:
ILLS (boolean, optional): specify if a word is illegible in the source document (and consequently can't be corrected). This status can be used:
- during the production workflow (the control quality process needs to know if a specific word is part or not of the guaranteed text quality perimeter; besides, this status informs that the provider has made a manual task on the word)
- by the viewing software: end users should be informed that some words are illegible in the source document itself (it's not an OCR error.)

DBTS (boolean, optional): specify that a word has been corrected but a doubt remains. Same use cases.
• These two attributes are part of the "OCR correction family" attributes, with CS (Correction Status), already defined by the schema.

Remarks:

  • ILLS could be useful on the TextBlock/TextLine types too (if not, the ILLS attribute on all the block's Strings must be set to TRUE):
    • areas of the page with physical defaults: stains, blur, etc.
    • areas of the page with scan defaults: curvature near the binding, missing blocks near the margins, etc.

To be consistent, these attributes could be defined on TextBlock, TextLine and String levels with a recommendation: always use the higher level available to set the attribute (ie: do not set an attribute on all the sub-elements of a specific level).

Examples

< String ID="PAG_00000001_ST000029" STYLEREFS="TXT_1" HPOS="3413" VPOS="296" HEIGHT="448" WIDTH="992" WC="0.34" ILLS="true" CONTENT="AnfûràoII"/>

< String ID="PAG_00000001_ST000029" STYLEREFS="TXT_1" HPOS="3413" VPOS="296" HEIGHT="448" WIDTH="992" WC="0.34" DBTS="true" CONTENT="droits"/>

Schema change:
<xsd:attribute name="ILLS" type="xsd:boolean" use="optional">
xsd:annotation
xsd:documentationThe word is illegible in the source document and can't be manually corrected. If the content owner thinks the word is legible, the attribute must be dropped (ILLS="false" is not recommended)/xsd:documentation
/xsd:annotation
/xsd:attribute

<xsd:attribute name="DBTS" type="xsd:boolean" use="optional ">
xsd:annotation
xsd:documentationThe word has been manually corrected but a doubt remains. If the content owner thinks the doubt is not legimitate, the attribute must be dropped (DBTS="false" is not recommended)./xsd:documentation
/xsd:annotation
/xsd:attribute

CircleType - annotation change and type definition

Proposal for changed annotation / documentation to clear the definition of the point described by HPOS / VPOS.
Also the type definition to be defined

<xsd:complexType name="CircleType">
  <xsd:annotation>
    <xsd:documentation>A circle shape.</xsd:documentation>
  </xsd:annotation>
  <xsd:attribute name="HPOS"/>
  <xsd:attribute name="VPOS"/>
  <xsd:attribute name="RADIUS"/>
</xsd:complexType>
<xsd:complexType name="CircleType">
  <xsd:annotation>
    <xsd:documentation>A circle shape. HPOS and VPOS describe
       there the center of the circle.</xsd:documentation>
  </xsd:annotation>
  <xsd:attribute name="HPOS" type="xsd:float" use="required"/>
  <xsd:attribute name="VPOS" type="xsd:float" use="required"/>
  <xsd:attribute name="RADIUS" type="xsd:float" use="required"/>
</xsd:complexType>

Referencing upper level document within ALTO files

With the current schema, the related document can't be referenced in the ALTO files.
This information could help to reference:

  • the ID of the source document in the library catalog,
  • the ID of the digital document in the digital library,
  • a production ID during the digitalization process, etc.

Use case:
Typical use cases are:

  1. The ability to access, within the ALTO files, to:
  • the document manifest (METS or other formats), if it's named with the same ID, and to retrieve some document level informations
  • some related files of the digital document, if they're named with the same ID

2.The ability to check that each ALTO file of a document package is related to the right document (identified with its ID). This (automatic) checking avoids file delivery errors (made by service providers).

3.When ALTO files are communicated/exchanged (between humans) on a file level, the information concerning the related document will be stored in the ALTO file itself.

Implementation 1:
New attribute on "alto" element (xsd:ID, optional):

<alto xmlns:xsi=... xmlns:xlink=... xmlns=...
documentIdentifier="7504400"> 

Implementation 2:
New "documentIdentifier" element on "sourceImageInformation" element:

<sourceImageInformation>
      <fileName>00000001.tif</fileName>
      <!-- Production ID-->
      <documentIdentifier IdentifierLocation="info:bnf/spar/reference#productionIdentifier">7504397</documentIdentifier> 
</sourceImageInformation>

Implementation 3:
No schema change needed, but misuse of fileIdentifier :

<sourceImageInformation>
      <fileName>00000001.tif</fileName>
      <fileIdentifier fileIdentifierLocation="info:bnf/spar/reference#productionIdentifier">7504397</fileIdentifier> 
      <!-- ID of the catalog entry -->
      <fileIdentifier fileIdentifierLocation="http://purl.org/dc/elements/1.1/relation">ark:/12148/cb328051026</fileIdentifier> 
</sourceImageInformation>

Schema change:

Processing history

Recently, several feature requests were submitted that relate to the recording of processing information in ALTO (see #13, #27, #36, #35 for in-depth information). In an attempt to consolidate and harmonize the requests, this issue shall serve as the main point of discussion from now on.

Features requested:

  • Change OCRProcessing to generic Processing (#13, #35).
  • Change preProcessingStep, ocrProcessingStep, postProcessingStep to generic processingStep with processingStepType element to record the type of processing performed (#13).
  • Add required attribute ID to ProcessingStepType (#13, #27, #35).
  • Add optional attributes COR (CORRECTEDBY), VER (VERIFIEDBY) for all elements. The attributes are holding a list of references (using the ID attribute) to all processingStepType entries which have changed the original value (#27).
  • Being able to link elements to a particular processingStep (#35).
    Example: Use Tesseract's page segmentation with Ocropus's recognition, so that TextLine elements are sourced from one ProcessingStep (Ocropus), but their text content from another one (Tesseract).
  • Common vocabulary of processingStepType attribute values to increase interoperability (#36)

OCR correction attributes: CS, ILLS, DBTS

Use cases:

These String related attributes can be used to describe human based decisions/actions during the OCR text correction process:
ILLS (boolean, optional): specify if a word is illegible in the source document (and consequently can't be corrected). This status can be used:
- during the production workflow (the control quality process needs to know if a specific word is part or not of the guaranteed text quality perimeter ; besides, this status informs that the provider made a manual task on the word)
- by the viewing software: end users should be informed that some words are illegible in the source document itself (it's not an OCR error...)

DBTS (boolean, optional): specify that a word has been corrected but a doubt remains. Same use cases.
• These two attributes are part of the "production family" attributes, with CS (Correction Status), already defined by the schema.

Remarks: ILLS could be useful on the TextBlock/TextLine types too:

  • areas of the page with physical defaults: stains, blur, etc.
  • areas of the page with scan defaults: curvature near the binding, missing blocks near the margins, etc.

These attributes must be defined with a recommendation: always use the highest level possible to set the attribute (ie: do not set an attribute on all the sub-elements).

Examples:

<String ID="PAG_00000001_ST000029" STYLEREFS="TXT_1" HPOS="3413" VPOS="296" HEIGHT="448" WIDTH="992" WC="0.34" ILLS="true" CONTENT="AnfûràoII"/>

<String ID="PAG_00000001_ST000029" STYLEREFS="TXT_1" HPOS="3413" VPOS="296" HEIGHT="448" WIDTH="992" WC="0.34" DBTS="true" CONTENT="droits"/> 

Schema change:

<xsd:attribute name="ILLS" type="xsd:boolean" use="optional"> 
 <xsd:annotation > 
  <xsd:documentation>The word is illegible in the source document and can't be manually corrected. If the content owner thinks the word is legible, the attribute must be dropped (ILLS="false" is not recommended)< /xsd:documentation  > 
 </xsd:annotation  > 
</xsd:attribute>
<xsd:attribute name="DBTS" type="xsd:boolean" use="optional">  
 <xsd:annotation >
   <xsd:documentation>The word has been manually corrected but a doubt remains. If the content owner thinks the doubt is not legimitate, the attribute must be dropped  (DBTS="false" is not recommended).< /xsd:documentation   >  
 </xsd:annotation >
</xsd:attribute> 

Tags/Tag

A reference to a new element group "Tags/Tag" will allow to add additional information to the elements which are referring to these tag elements.
This idea was born on the face-to-face meeting to cover several existing change requests

Test tasklist

  • first
  • second
  • third
  • third
<Tags>
  <Tag ID="Tag01" TYPE="STRUCTURE" LABEL="Title" DESCRIPTION="Title"/>
  <Tag ID="Tag02" TYPE="STRUCTURE" LABEL="RunningTitle" DESCRIPTION="Repeating text on each site"/>
  <Tag ID="Tag03" TYPE="STRUCTURE" LABEL="TOC" DESCRIPTION="Table of content"/>
  <Tag ID="Tag04" TYPE="STRUCTURE" LABEL="PageNumberReference" DESCRIPTION="Reference to page number of other page"/>
  <Tag ID="Tag05" TYPE="STRUCTURE" LABEL="Footnote" DESCRIPTION="Footnote text"/>
  <Tag ID="Tag06" TYPE="STRUCTURE" LABEL="FootnoteReference" DESCRIPTION="Reference to footnote"/>
  <Tag ID="Tag07" TYPE="STRUCTURE" LABEL="Headline" DESCRIPTION="Headline of article" />
  <Tag ID="Tag08" TYPE="STRUCTURE" LABEL="Subheadline" DESCRIPTION="Subheadline of article" />
  <Tag ID="Tag09" TYPE="LAYOUT" LABEL="MusicalScore" DESCRIPTION="Musical notation"/>
  <Tag ID="Tag10" TYPE="LAYOUT" LABEL="MathFormula" DESCRIPTION="Mathematical formula"/>
  <Tag ID="Tag11" TYPE="LAYOUT" LABEL="ChemFormula" DESCRIPTION="Chemical formula"/>
  <Tag ID="Tag12" TYPE="NAMED-ENTITY" SUBTYPE="Person" LABEL="Zachariah Jackson">
    <mads:mads version="2.0" xmlns:mads="http://www.loc.gov/mads/v2" xsi:schemaLocation="http://www.loc.gov/mads/v2 http://www.loc.gov/standards/mads/mads.xsd">
        <mads:authority geographicSubdivision="not applicable">
            <mads:name type="personal" authority="naf">
                <mads:namePart>Jackson, Zachariah</mads:namePart>
            </mads:name>
        </mads:authority>
        <mads:variant type="other">
            <mads:name type="personal">
                <mads:namePart>Jackson, Z. (Zachariah)</mads:namePart>
            </mads:name>
        </mads:variant>
        <mads:note type="nonpublic">Appears as printer or bookseller in Dublin imprints 1789-1799</mads:note>
        <mads:note type="source">Jackson, Z. Shakspeare's genius justified, 1819: t.p. (Z. Jackson)</mads:note>
        <mads:note type="source">OCLC, Oct. 23, 1997 (hdg.: Jackson, Zachariah; usage: Zachariah Jackson; Z. Jackson)</mads:note>
        <mads:note type="notFound">Munter, R. Print trade in Ireland 1550-1775</mads:note>
        <mads:identifier type="lccn">no 97063153 </mads:identifier>
        <mads:recordInfo>
            <mads:recordOrigin>Converted from MARCXML to MADS version 2.0 (Revision 2.10)</mads:recordOrigin>
            <mads:recordContentSource authority="marcorg">IEN</mads:recordContentSource>
            <mads:recordChangeDate encoding="iso8601">20090314071654.0</mads:recordChangeDate>
            <mads:recordIdentifier source="DLC">no97063153</mads:recordIdentifier>
            <mads:languageOfCataloging>
                <mads:languageTerm authority="iso639-2b" type="code">eng</mads:languageTerm>
            </mads:languageOfCataloging>
            <mads:descriptionStandard>aacr2</mads:descriptionStandard>
        </mads:recordInfo>
    </mads:mads>
    </Tag>
  <Tag ID="Tag13" TYPE="NAMED-ENTITY" SUB_TYPE="person" LABEL="John Johnson" />
  <Tag ID="Tag14" TYPE="NAMED-ENTITY" SUB_TYPE="location" LABEL="London"/>
    <mads:mads version="2.0" xmlns:mads="http://www.loc.gov/mads/v2" xsi:schemaLocation="http://www.loc.gov/mads/v2 http://www.loc.gov/standards/mads/mads.xsd">
        <mads:authority geographicSubdivision="not applicable">
            <mads:name type="personal" authority="naf">
                <mads:namePart>Jackson, Zachariah</mads:namePart>
            </mads:name>
        </mads:authority>
    </mads:mads:
  <Tag ID="Tag15" TYPE="NAMED-ENTITY" SUB_TYPE="person" LABEL="Johann Christian Bach"/>
</Tags>
<TextBlock ID="P7_TB00003" HPOS="76" VPOS="358" WIDTH="870" HEIGHT="32" STYLEREFS="TXT_0 PAR_CENTER" TAGREFS="Tag01" >
     <TextLine ID="P7_TL00003" HPOS="78" VPOS="358" WIDTH="868" HEIGHT="32">
     <String ID="P7_ST00005" HPOS="78" VPOS="358" WIDTH="356" HEIGHT="32" CONTENT="Bach" WC="0.70" CC="5063" TAGREFS="Tag15"/>
     <SP ID="P7_SP00003" HPOS="434" VPOS="390" WIDTH="24"/>
     <String ID="P7_ST00006" HPOS="458" VPOS="360" WIDTH="95" HEIGHT="30" CONTENT="J." WC="0.74" CC="50" TAGREFS="Tag15"/>
     <SP ID="P7_SP00004" HPOS="553" VPOS="390" WIDTH="22"/>
     <String ID="P7_ST00007" HPOS="575" VPOS="359" WIDTH="371" HEIGHT="31" CONTENT="ILLUSTRATIONS" WC="0.88" CC="0000400500104"/>
     </TextLine>
</TextBlock>

Reading Order (IMPACT)

use case
Modern OCR software is able to recognize sections within a page. The logical text flow of between some of the sections may be continuous. The OCR software however may not always be able recognize the text flow correctly and stores these sections in non-continuous parts of the ALTO file.

Additional processing software or manual intervention may add correct this problem. ALTO has to store the reading order explicitly. The reading order should not rely on the order of XML elements in the ALTO file.

implementation
A single element defines the information flow for every section in the document. This section is called Region. Each region is specified by a element. This element points to either one Block-element (see chapter 4). Each region is part of a group. A group can contain regions that are

•unordered (information flow doesn’t have a particular order) or
•ordered (information flow has a particular order).
The appropriate elements are called or . Every region must be part of exactly one group. All regions in the ordered group must provide their position within the group. This position is stored in the ORDER attribute. The value of the ORDER attribute must be an integer and be unique within the group.

In order to represent complex information flows within a page groups may have an unlimited number of sub-groups. The sub-groups are of the type ordered and unordered groups. Both types of groups may have any type of sub-groups.

example

       <ReadingOrder>

               <OrderedGroup ID=”G1”>

                     <RegionRef IDREF="xxxx001" ORDER="1"/>

                     <RegionRef IDREF="…" ORDER="2"/>

                     <RegionRef IDREF="……" ORDER="3"/>

               </OrderedGroup>

               <UnorderedGroup ID=”UG1!”>

                     <RegionRef IDREF="…"/>

                     <RegionRef IDREF="……"/>

               </UnorderedGroup>

      </ReadingOrder>

      <Layout>

                     <Page>

                           <PrintSpace>

                                 <TextBlock ID="xxxx001">

                                       <TextLine>

                                             <String CONTENT="Advertisement"/>

                                       </TextLine>

                                 </TextBlock>

                           </PrintSpace>

                     </Page>          

                           ..... the complete layout description

   </Layout>

EllipseType - annotation change and type definition

Proposal for EllipseType annotation change, to clarify point description.

Current schema

<xsd:complexType name="EllipseType">
  <xsd:annotation>
    <xsd:documentation>An ellipse shape.</xsd:documentation>
  </xsd:annotation>
  <xsd:attribute name="HPOS"/>
  <xsd:attribute name="VPOS"/>
  <xsd:attribute name="HLENGTH"/>
  <xsd:attribute name="VLENGTH"/>
</xsd:complexType>

Proposed schema

<xsd:complexType name="EllipseType">
  <xsd:annotation>
    <xsd:documentation>An ellipse shape. The point described is the center of the shape.
         HLENGTH and VLENGTH are the width and height of the described ellipse.
    </xsd:documentation>
  </xsd:annotation>
  <xsd:attribute name="HPOS" type="xsd:float" use="required"/>
  <xsd:attribute name="VPOS" type="xsd:float" use="required"/>
  <xsd:attribute name="HLENGTH" type="xsd:float" use="required"/>
  <xsd:attribute name="VLENGTH" type="xsd:float" use="required"/>
</xsd:complexType>

ALTO support for OCR of video

ALTO should support OCR of video efficiently.

This is a future-looking issue, not something we're likely to address immediately, but something to keep in mind as we drive progress of ALTO to be a suitable representation of all OCR output or ground truth in general, whether the source be scanned documents, scene text, screenshots, or video.

Video may require special consideration because the straightforward approach, having an ALTO record for the result of OCR of each frame, would be grossly inefficient since in most videos, OCR is present in only some portions, and text tends to persist over segments, either entirely or partially.

To track and drive this capability, this issue proposes that ALTO should represent the "ideal" OCR results of the attached video in much the way a human commentator would: by describing the overall text once, and representing dynamically the changing text-region boundaries in the moving scene in an efficient manner, e.g., by encoding differences in bounding boxes or by describing the motion parametrically.

Considering video may also drive discussion of the relative roles of layout representation and text-fragment representation, and of collection-level annotation (e.g., book or video or newspaper) and page-level annotation.

Relevant files:

This issue will be considered fixed when the following has happened:

For the referenced two video files, represent the ideal OCR results (i.e., OCR groundtruth) efficiently using ALTO and attach the XML files to this issue.

Non Linear Hyphens

Describing hyphen running on 2 pages or between main text flow and footnotes block is undeterministic.

Example:
left page: one hyphen in last footnote: "Victor-"
right page: one hyphen in main text flow ("Font-") and 2nd part of page 194 hyphen ("Hugo")

In this example, ALTO markup could let one think that String "Font-" is the first part of the hyphen (HypPart1), and String "Hugo" the second part (HypPart2). In such a case, a validation tool on hyphens consistency will fail at doing its job.

These ALTO files were produced during an EPUB+ALTO digitization program. EPUB format needs to identify footnotes and consequently, export of hyphens in ALTO files are logically correct but "unclear" in the ALTO "context".

...
<String ID="PAG_00000213_ST000193" CONTENT="Fon-" HEIGHT="44" HPOS="1335" STYLEREFS="TXT_14" SUBS_CONTENT="Fontanes" SUBS_TYPE="HypPart1" VPOS="2214" WC="1" WIDTH="100"/>
<HYP CONTENT="-" HPOS="1435" VPOS="2214" WIDTH="26"/>
</TextLine>
</TextBlock>
<TextBlock ID="PAG_00000213_TB000010" HEIGHT="156" HPOS="224" STYLEREFS="TXT_77" VPOS="2394" WIDTH="1236" language="FR">
<TextLine ID="PAG_00000213_TL000025" BASELINE="2431" HEIGHT="48" HPOS="224" VPOS="2394" WIDTH="1235">
<String ID="PAG_00000213_ST000194" CONTENT="Hugo" HEIGHT="45" HPOS="224" STYLEREFS="TXT_7" SUBS_CONTENT="Victor-Hugo" SUBS_TYPE="HypPart2" VPOS="2394" WC="0.983" WIDTH="110"/>

212

213

ALTO - PAGE xml: Object mapping and possible transformation generation

On face-2-face conference in Vienna the idea came up to generate a conversion between PAGE and ALTO as best-practice mapping between the different standard objects.
If feasible, a transformation could be provided by XSLT.

The idea is to create a mapping on the latest ALTO version 4 to upcoming PAGE version in June and from there going backwards as far this makes sense.

Target is to get a common solution for mapping especially for objects where no exact matching is possible and workarounds or compromises need to be defined.

Process Result tracking (IMPACT)

Champion: Clemens Neudecker
Submitter: Impact
Submitted: 2013-02
Status: discussion


submitted - initial status when proposal is submitted

discussion - proposal is being discussed within the board

review - xsd code is being reviewed

accepted - proposal is accepted

rejected - proposal is rejected

draft - accepted proposal is in public commenting period

published - proposal is published in a schema version

Backwards compatible ??
To ALTO version ?

Purpose
A lot of software tools and also human interactions are involved in different steps of the digitisation process. Each of them may affect an ALTO file by doing some refinements or corrections. From our point of view it would be desirable to keep track of the changes and verification done by the different agents which are involved in the digitisation process. This would allow a simple kind of a document history and gives also important information about the trustworthily of the whole document. If for example everything was verified by a service provider than we can asume that the quality of the document is very high. Storing the old values as well as the new ones would increase the filesize tremendously.

Correction and Validation are possible outcomes of the same process.

Implementation
The ALTO schema already defines a element. The intention of this element is to record any details about those process steps that were carried out after the creation of the full text. The element is optional and not part of the actual page’s definition in ALTO.

In order to store information about the correction and verification process for individual text lines, words etc. the following elements are added to the section:

• stores the type of process step. It is a free text field, though IMPACT internal constraints require the element’s value to be set to “correction”.
• groups all elements regarding the result of the process. The element’s value attribute contains information about the outcome of the process. The element is repeatable. Each element represents a specific outcome of the process that is recorded in the element’s value attribute. This attribute may only contain two values: “corrected” or “verified”.
• is an element that wraps around all elements that were processed with the actual result as stated in the element’s value attribute.
• element contain the ID-value of an individual text line or word element. Unprocessed are not listed here.
If an element had not been processed, the element is not listed within .

Example:

<postProcessingStep ID="0003">      
  <processingDateTime>2012-05-26T09:34:00+02:00</processingDateTime>      
  <processingAgency>ACME Agency</processingAgency>     
  <processingStepDescription>Proofreading</processingStepDescription>     
  <processingStepSettings>Double keying required</processingStepSettings>     
  <processingSoftware>
   <softwareCreator>ACME Software Corp.</softwareCreator>           
   <softwareName>Proofer</softwareName>
   <softwareVersion>12.1</softwareVersion>
   <applicationDescription>Distributed proofreading software</applicationDescription>     
  </processingSoftware>
  <processingResult value="Proof reading performed">
    <processedElements>
      <pe>P4_TB00003</pe>
      <pe>P4_TB00002</pe>
      <pe>P4_ST00004</pe>
    </processedElements>
  </processingResult>
  <processingResult value="Uncorrected">
    <processedElements>
      <pe>P4_TB00003</pe>
      <pe>P4_TB00002</pe>
      <pe>P4_ST00004</pe>
    </processedElements>
  </processingResult>
</postProcessingStep>

Schema changes draft

Current schema Changed schema

<xsd:complexType name="processingStepType">
  <xsd:annotation>
    <xsd:documentation>A processing step.</xsd:documentation>
  </xsd:annotation>
  <xsd:sequence>
    <xsd:element name="processingDateTime" type="dateTimeType" minOccurs="0">
      <xsd:annotation>
        <xsd:documentation>Date or DateTime the image was processed.</xsd:documentation> 
      </xsd:annotation>
    </xsd:element>
    <xsd:element name="processingAgency" type="xsd:string" minOccurs="0">
      <xsd:annotation>
        <xsd:documentation>Identifies the organizationlevel producer(s) of the processed image.</xsd:documentation>
      </xsd:annotation>
    </xsd:element>
    <xsd:element name="processingStepDescription" type="xsd:string"  minOccurs="0" maxOccurs="unbounded">
      <xsd:annotation>
        <xsd:documentation>An ordinal listing of the image processing steps performed. For example, "image despeckling."</xsd:documentation>
      </xsd:annotation>
    </xsd:element>
    <xsd:element name="processingStepSettings" type="xsd:string" minOccurs="0">
      <xsd:annotation>
        <xsd:documentation>A description of any setting of the processing application.
        For example, for a multi-engine OCR application this might include the
        engines which were used. Ideally, this description should be adequate so
        that someone else using the same application can produce identical
        results.
        </xsd:documentation>
      </xsd:annotation>
    </xsd:element>
    <xsd:element name="processingSoftware" type="processingSoftwareType" minOccurs="0"/>    
  </xsd:sequence>
</xsd:complexType>
<xsd:complexType name="processingStepType">
  <xsd:annotation>
    <xsd:documentation>A processing step.</xsd:documentation>
  </xsd:annotation>
  <xsd:sequence>
    <xsd:element name="processingStepType" type="dateTimeType" minOccurs="0">    
      <xsd:annotation>
        <xsd:documentation>Type of processing step</xsd:documentation>
      </xsd:annotation>
   </xsd:element>
   <xsd:element name="processingDateTime" type="dateTimeType" minOccurs="0">    <xsd:annotation>    <xsd:documentation>Date or DateTime the image was processed.</xsd:documentation>   </xsd:annotation>  </xsd:element>  <xsd:element name="processingAgency" type="xsd:string" minOccurs="0">   <xsd:annotation>    <xsd:documentation>Identifies the organizationlevel producer(s) of the
      processed image.</xsd:documentation>   </xsd:annotation>  </xsd:element>  <xsd:element name="processingStepDescription" type="xsd:string"               minOccurs="0" maxOccurs="unbounded">   <xsd:annotation>    <xsd:documentation>An ordinal listing of the image processing steps performed.
        For example, "image despeckling."</xsd:documentation>   </xsd:annotation>  </xsd:element>  <xsd:element name="processingStepSettings" type="xsd:string" minOccurs="0">   <xsd:annotation>    <xsd:documentation>A description of any setting of the processing application.
        For example, for a multi-engine OCR application this might include the
        engines which were used. Ideally, this description should be adequate so
        that someone else using the same application can produce identical
        results.</xsd:documentation>   </xsd:annotation>  </xsd:element>  <xsd:element name="processingSoftware" type="processingSoftwareType"               minOccurs="0"/>  <xsd:element name="processingResult" type="processingResultType"               minOccurs="0" maxOccurs="unbounded"/> </xsd:sequence></xsd:complexType>  
  <xsd:complexType name="processingResultType">
 <xsd:annotation>  <xsd:documentation>List of processed elements.</xsd:documentation>
 </xsd:annotation>
 <xsd:sequence>
  <xsd:element name="processedElements" minOccurs="0" maxOccurs="unbounded">
   <xsd:annotation>
    <xsd:documentation>ID of processed element</xsd:documentation>
   </xsd:annotation>
   <xsd:complexType>
    <xsd:sequence>
     <xsd:element name="pe" type="xsd:IDREF" minOccurs="1" maxOccurs="unbounded">     </xsd:element>
    </xsd:sequence>
   </xsd:complexType>
  </xsd:element>
 </xsd:sequence>
 <xsd:attribute name="value" type="xsd:string"></xsd:attribute>
</xsd:complexType>  

Allow shape-element usage (IMPACT)

Submitter: IMPACT
Submitted: 2013-02

Use Case

ALTO 2.0 uses four attributes (HEIGHT, WIDTH,HPOS, VPOS) to describe the location and size of a text line. These coordinates do not describe the text line as such, but a bounding box around the text line. This box is always a rectangle.
For analysis purposes the shape need be described more precisely and coordinate information must not just be limited to text lines. The shape should also be recorded for individual characters, words, text lines, blocks and the print space as such.

Implementation
The element should store either polygon, rectangle, ellipse or circle. There must not be a sibling element for the same parent. The element is optional. A element can be added to the following parent elements:

  • Glyph
  • String
  • TextLine
  • PageSpaceType
  • All block types

The <shape> element can only have a single child element. This child element describes the type of shape with the exact coordinates. All coordinates are expressed as float values. The following shape types are supported by the appropriate elements:

.... .... ....
Polygon <Polygon> POINTS attribute contains a list of coordinate-pairs. Each coordinate pair consists of a HPOS and VPOS value separated by a whitespace character. The pairs are separate by a whitespace as well. A POINTS attribute must at least contain three pairs.
Circle <Circle> HPOS and VPOS attributes specify the center of the circle. RADIUS specify the radius of the circle.
Ellipse <Ellipse> HPOS and VPOS attributes specify the centre of the ellipse. HLENGTH specifies the horizontal radius of the ellipse. VLENGTH specifies the vertical radius of the ellipse. ROTATION specifies the degrees counterclockwise.

For backward compatibility, rectangles will continue to be expressed using the existing HEIGHT, WIDTH, HPOS and VPOS attributes on all blocktypes that can carry the new Shape element. However, their use will be changed to optional, if a Shape element exists. Each Shape child (Polygon, etc.) will contain it's coordinates (required).

<TextBlock language="de-DE" ID="ID017" STYLEREF="ID011" HEIGHT="1564" WIDTH="1592" HPOS="193" VPOS="364">
  <TextLine ID="D035" STYLEREFS="ID002" BASELINE="1265" CS="false">
    <Shape>
      <Polygon POINTS="752.2 1239.1 752 1672 805 1672 805 1239"/>
    </Shape>
    <String ID="P13_ST00001" HPOS="539" VPOS="562" WIDTH="681" HEIGHT="39" CONTENT="Advertisement" WC="0.35" CC="8688056667845757"/>
  </TextLine>
</TextBlock>

Changes

current (ALTO 3.0)

<xsd:complexType name="StringType" mixed="false">
    <xsd:annotation>
        <xsd:documentation>A sequence of chars. Strings are separated by white spaces or hyphenation chars.</xsd:documentation>
    </xsd:annotation>
    <xsd:sequence minOccurs="0">
        <xsd:element name="ALTERNATIVE" maxOccurs="unbounded"/>
... 

proposed

<xsd:complexType name="StringType" mixed="false">
    <xsd:annotation>
        <xsd:documentation>A sequence of chars. Strings are separated by white spaces or hyphenation chars.</xsd:documentation>
    </xsd:annotation>
    <xsd:sequence minOccurs="0">
        <xsd:element name="Shape" type="ShapeType" minOccurs="0" maxOccurs="1"/>
        <xsd:element name="ALTERNATIVE" minOccurs="0"  maxOccurs="unbounded"/>
...

current

<xsd:element name="TextLine" maxOccurs="unbounded">
    <xsd:annotation>
        <xsd:documentation>A single line of text.</xsd:documentation>
    </xsd:annotation>
    <xsd:complexType>
    <xsd:sequence>
        <xsd:sequence maxOccurs="unbounded">
            <xsd:element name="String" type="StringType"/>
            <xsd:element name="SP" minOccurs="0"/>
...

proposed

<xsd:element name="TextLine" maxOccurs="unbounded">
    <xsd:annotation>
        <xsd:documentation>A single line of text.</xsd:documentation>
    </xsd:annotation>
...
    <xsd:attribute name="HEIGHT" type="xsd:float" use="optional"/>
    <xsd:attribute name="WIDTH" type="xsd:float" use="optional"/>  
    <xsd:attribute name="HPOS" type="xsd:float" use="optional"/>
    <xsd:attribute name="VPOS" type="xsd:float" use="optional"/>
...

    <xsd:complexType>
        <xsd:sequence>
        <xsd:sequence maxOccurs="unbounded">
        <xsd:element name="Shape" type="ShapeType" minOccurs="0" maxOccurs="1"/>
        <xsd:element name="String" type="StringType"/>
        <xsd:element name="SP" minOccurs="0"/>
...

current

...

proposed

<xsd:complexType name="GlyphType" mixed="false">
    <xsd:annotation>
        <xsd:documentation>A sequence of chars. Strings are separated by white spaces or hyphenation chars. Rectangular attributes HEIGHT, WIDTH, HPOS , VPOS are only foruse describing rectangles; omit if describing other than rectangular shapes. 
        </xsd:documentation>
    </xsd:annotation>
...
    <xsd:attribute name="HEIGHT" type="xsd:float" use="optional"/>
    <xsd:attribute name="WIDTH" type="xsd:float" use="optional"/>  
    <xsd:attribute name="HPOS" type="xsd:float" use="optional"/>
    <xsd:attribute name="VPOS" type="xsd:float" use="optional"/>
...
    <xsd:sequence minOccurs="0">
    <xsd:element name="Shape" type="ShapeType"  minOccurs="0" maxOccurs="1"/>
    <xsd:element name="Variant" minOccurs="0" maxOccurs="unbounded"/>
...

current

<xsd:complexType name="PageSpaceType">
    <xsd:annotation>
        <xsd:documentation>A region on a page</xsd:documentation>
    </xsd:annotation>
    <xsd:sequence minOccurs="0" maxOccurs="unbounded"> 
        <xsd:group ref="BlockGroup"/>
    </xsd:sequence>
    <xsd:attribute name="ID" type="xsd:ID" use="optional"/>
...

proposed

<xsd:complexType name="PageSpaceType">
    <xsd:annotation>
        <xsd:documentation>A region on a page</xsd:documentation>
    </xsd:annotation>
    <xsd:sequence minOccurs="0" maxOccurs="unbounded">
        <xsd:element name="Shape" type="ShapeType" minOccurs="0"  maxOccurs="1"/>
        <xsd:group ref="BlockGroup" minOccurs="0"/>
    </xsd:sequence>
    <xsd:attribute name="ID" type="xsd:ID" use="optional"/>
...

current

<xsd:complexType name="EllipseType">
    <xsd:annotation>
        <xsd:documentation>An ellipse shape.</xsd:documentation>
    </xsd:annotation>
    <xsd:attribute name="HPOS" type="xsd:float"/>
    <xsd:attribute name="VPOS" type="xsd:float"/>
    <xsd:attribute name="HLENGTH" type="xsd:float"/>
    <xsd:attribute name="VLENGTH" type="xsd:float"/>
</xsd:complexType>
...

proposed

<xsd:complexType name="EllipseType">
    <xsd:annotation>
        <xsd:documentation>An ellipse shape.</xsd:documentation>
    </xsd:annotation>
    <xsd:attribute name="HPOS" type="xsd:float" use="required"/>
    <xsd:attribute name="VPOS" type="xsd:float" use="required"/>
    <xsd:attribute name="HLENGTH" type="xsd:float" use="required" />
    <xsd:attribute name="VLENGTH" type="xsd:float" use="required"/>
    <xsd:attribute name="ROTATION" type="xsd:float" use="optional">
        <xsd:annotation>
           <xsd:documentation>Tells the rotation of the block e.g. text or illustration.  The value is in degrees counterclockwise. </xsd:documentation>
        </xsd:annotation>
    </xsd:attribute>
</xsd:complexType>
...

current

<xsd:complexType name="CircleType">
        <xsd:annotation>
            <xsd:documentation>A circle shape.</xsd:documentation>
        </xsd:annotation>
        <xsd:attribute name="HPOS" type="xsd:float"/>
        <xsd:attribute name="VPOS" type="xsd:float"/>
        <xsd:attribute name="RADIUS" type="xsd:float"/>
    </xsd:complexType>
...

proposed

<xsd:complexType name="CircleType">
    <xsd:annotation>
        <xsd:documentation>A circle shape.</xsd:documentation>
    </xsd:annotation>
    <xsd:attribute name="HPOS" type="xsd:float" use="required"/>
    <xsd:attribute name="VPOS" type="xsd:float"  use="required"/> 
    <xsd:attribute name="RADIUS" type="xsd:float" use="required"/>
</xsd:complexType>
...

Change OCRProcessing to Processing

The current process recording elements are fixed with OCR and on the other hand bit redundand. Would it make sense to change to and the ,, to generic with element to record the type of processing performed, like in impact Process result suggestion.

Expand schema documentation for PointsType

PointsType in ALTO v4 has very basic documentation:

<xsd:documentation>A list of points</xsd:documentation>

It would seem clearer to explictedly surface PointsType as a list of coordinate-pairs, particularly for complex shapes and polylines. For example, using the Polygon syntax from issue 22:

<Shape>
  <Polygon POINTS="752.2 1239.1 752 1672 805 1672 805 1239"/>
</Shape>

This is arguably clearer as a list of coordinate pairs by using commas:

<Shape>
  <Polygon POINTS="752.2,1239.1 752,1672 805,1672 805,1239"/>
</Shape>

Or perhaps:

<Shape>
  <Polygon POINTS="(752.2,1239.1) (752,1672) (805,1672) (805,1239)"/>
</Shape>

The documentation might be a variation of what is used for MeasurementUnitType:

<xsd:documentation>
A list of coordinate-pairs that are absolute to the upper-left corner of a page. The upper 
left corner of the page is defined as coordinate (0,0).
</xsd:documentation>

This would seem to reduce the possibility of missing a coordinate and be more friendly to software interpretation without breaking backwards compatibility.

Confidence value calculation (CC - WC - PC) - annotation extension

Submitter: CCS  ([email protected])
Submitted: 2013-02
Status: Discussion
Backwards compatible:**Yes (Only Annotation)**
To ALTO Version: ?

For the page / word and character confidence the values for the calculation are not defined in the schema.
To establish a common calculation method the idea was to share the calculation method and to define a common rule for this to make the confidence values comparable.

Here the calculation methods as calculated until now by CCS with docWorks.

Precondition detail:

ABBYY FineReader up to version 7.1: the character confidence range was defined for 28 (good) to 55 (bad)

ABBYY FineReader starting version 8.0: the character confidence range was defined for 0 (good) to 100 (bad)

These ranges have to be transformed into the range defined by ALTO (range 0 to 9; see below). There unsharpness appeares.

CCS continued calculations for WC due to that on more precised values from ABBYY (range 28 - 55 / 0 - 100), Due to that rounding differences can appear on following values of WC from CC within the ALTO!

CC:

The character confidence is defined in ALTO in a scale of "0" to "9" - "0" is best, 9 is worst.

Character Confidence is determined according to ABBYY character confidence.
The results from the Finreader engines are normalized to the ALTO scale of 0 to 9 per character.
e.g. the word FAX - detected 100% ok by OCR engine will have a CC of 000 - one digit for every character.

WC:

Word Confidence is determined based on character level confidence.
The better the character confidence the better the word confidence.
In addition the word confidence is influenced by the dictionary verification.

If a word is found in the dictionary, it increases the word confidence value.
The longer the word, the higher the confidence value.
(Explanation: If a long word (e.g. with 15 characters) is found in dictionary it is pretty sure that the word is correct, while on wrong detected character a match against the dictionary by mistake is unlikely. Short words like 'fun' / 'fan' will both be found in dictionary. There is no improved guarantee by dictionary check, that the right word is detected.)
Due to that also words with 2 or less characters are not checked against the dictionary.

The word confidence is normalized to an interval of "0.00 to "1.00" - "1.00" best, "0.00" worst.
Calculation:
double( (sum CC)/numChar )/1000.0 - normalization to (0,1)
Example:

                <String HPOS="5485" VPOS="4654" WIDTH="468" HEIGHT="109" CONTENT="quorum" WC="1.00" CC="211110"/>

                <SP HPOS="5953" VPOS="4762" WIDTH="104"/>

                <String HPOS="6057" VPOS="4606" WIDTH="524" HEIGHT="132" CONTENT="conliflmg" WC="0.89" CC="110121122"/>

                <SP HPOS="6581" VPOS="4762" WIDTH="61"/>

                <String HPOS="6643" VPOS="4592" WIDTH="128" HEIGHT="118" CONTENT="of" WC="0.93" CC="02"/>

                <SP HPOS="6770" VPOS="4762" WIDTH="52"/>

                <String HPOS="6822" VPOS="4635" WIDTH="61" HEIGHT="66" CONTENT="a" WC="0.85" CC="2"/>

                <SP HPOS="6883" VPOS="4762" WIDTH="71"/>

                <String HPOS="6954" VPOS="4597" WIDTH="468" HEIGHT="137" CONTENT="majority" WC="1.00" CC="12101111"/>

                <SP HPOS="7422" VPOS="4762" WIDTH="52"/>

                <String HPOS="7474" VPOS="4578" WIDTH="123" HEIGHT="113" CONTENT="of" WC="0.96" CC="01"/>

When a word is in the dictionary, confidence is 1.0, else is computed (mainly average of all “reversed” cc – means for “212” = ((10-2) + (10-1) + (10-2)) / 3 = 25/3 = 8.33, means a WC of 0.83)

For short words, less than 3 chars, the risk is to have incorrect characters. Due to that it is calculated differently. (still pending)

Details:

FR9( FR8.1, FR10 also) : ABBYY character confidence range is between 0-100
The character confidence is normalized to (0,9) . The word confidence is the sum of the characters confidences and in the end this is calculated as an average of the numbers of characters.

Before writing the WC attribute in ALTO, the word confidence is checked against ABBYY dictionary, whenever the word is found in dictionary the confidence increases:
1000 - ((1000 - charConfLevel) / (chars.GetSize()*3));

Otherwise if the word is not found in ABBYY dictionary the initial determined word confidence level is used and normalized to (0,1)

Note:
charConfLevel word confidence - average confidence on character basis.
chars.GetSize number of characters in word

PC:

The Page Confidence is calculated by average dictionary confidence of all alpha-numeric characters.
?
The page confidence is normalized to an interval of "0.00 to "1.00" - "1.00" best, "0.00" worst.

Details:
The confidence is calculated by adding all the confidences of the XMLTexts (sum of character confidence)

set confidenceSum [expr $confidenceSum + $noOfAlphaNumChars * $confidence ]
and in the end the total page confidence is calculated after this formula:
return [ expr $confidenceSum/$pgNoOfAlphaNumChars ]

Note:

confidence- XMLText dictionary confidence

The total characters confidence sum divided by the number of characters on the page, (normalized in the end to (0,1) ) determines the Page Confidence.

If there are zones but no OCR, the returned value is 999 for confidence as for a bad confidence level.
For blank pages the returned value is 100 for confidence – as to full confidence on blank pages.

Type definition for attribute "CONTENT" of HYP element

For the complexType "StringType" and "HYP" element an undefined attribute value "CONTENT is existing.

It is requested to define this to prevent wrong usage.

<xsd:element name="HYP" minOccurs="0">
  <xsd:annotation>
    <xsd:documentation>One or more hyphenation character. Can appear only at the end of a line.
    </xsd:documentation>
  </xsd:annotation>
  <xsd:complexType>
    <xsd:attribute name="WIDTH" type="xsd:float" use="optional"/>
    <xsd:attribute name="HPOS" type="xsd:float" use="optional"/>
    <xsd:attribute name="VPOS" type="xsd:float" use="optional"/>
    <xsd:attribute name="CONTENT" type="xsd:string" use="required"> 
    </xsd:attribute>
  </xsd:complexType>
</xsd:element>

Approval

Vocabulary for ProcessingStepDescriptions

One more from the wish list.

The nature of common *ProcessingStep elements (layout analysis, any kind of postcorrection) is only incompletely captured by MIX's change history and seem often to be out of scope of the MIX schema. It would therefore be beneficial to define a (optional?) vocabulary of possible processingStepDescription attribute values to increase interoperability between data sources.

Any comments?

Fragment identifier API for ALTO

The ALTO Fragment Identifier API is a proposal for a web service that, in response to a standard HTTP or HTTPS request:

  • references arbitrary content within an ALTO file through the use of fragment identifiers (referencing),
  • returns the XML contents referenced by such identifiers (dereferencing).

This service aims to facilitate reuse of ALTO resources in digital librairies (bookmarks, annotations...). It could be used to embody the concept of hyperlinking within ALTO documents, and to access to the content itself.

The URI could specify any portion of ALTO file (paragraph, string, illustration...) referenced by various mechanisms (ID, spatial offset, order...), range of contents (paragraphs 2 to 5), etc.

Note : the ALTO schema is not impacted. The whole idea is to edit a specification to be implemented by digital libraries (if they are willing to).

Use cases

See: http://prezi.com/6fvgzri_z3b3/?utm_campaign=share&utm_medium=copy

a. A digital library user wants to reference a specific marginalia on a specific page of a digital document, given its spatial position:
-> http://gallica.bnf.fr/ark:/12148/bpt6k96006893/f20.alto/id/@89:485
RETURNS a list of block IDs : ("PAG_00000020_TB000010")

-> http://gallica.bnf.fr/ark:/12148/bpt6k96006893/f20.alto/xml/TextBlock[ID=PAG_00000020_TB000010]
RETURNS: the TextBlock XML element
<TextBlock ID="PAG_00000020_TB000010" WIDTH="1386" HEIGHT="287" VPOS="1090" HPOS="1303" STYLEREFS="TXT_18" LANG="fr"
<TextLine ID="PAG_00000020_TL000016" WIDTH="1383" HEIGHT="63" VPOS="1090" HPOS="1304" STYLEREFS="TXT_18" <String ID="PAG_00000020_ST000071" ...

b. An application wants to list all the images on a specific page of a digital document:
-> http://gallica.bnf.fr/ark:/12148/bpt6k96128443/f26.alto/id/Illustration
RETURNS a list of block IDs: ("PAG_00000026_IL000001")

-> http://gallica.bnf.fr/ark:/12148/bpt6k96128443/f26.alto/xml/Illustration[ID=PAG_00000026_IL000001]
RETURNS the XML element:
<Illustration ID="PAG_00000026_IL000001" HPOS="744" VPOS="707" HEIGHT="3410" WIDTH="819"/

From this XML content, the application can then extract the illustration using IIIF:
-> http://gallica.bnf.fr/iiif/ark:/12148/bpt6k96128443/f26/744,707,819,3569/full/0/native.jpg

c. An application wants to extract all the text within the print space of a specific page:
-> http://gallica.bnf.fr/ark:/12148/bpt6k96128443/f26.alto/id/PrintSpace/*[@CONTENT]
RETURNS a list of block IDs: ("PAG_00000026_TB000002","PAG_00000026_TB000003","PAG_00000026_TB000004"...)

From this IDs, the application can then extract the XML elements and filter the text blocks to access the text itself.

Inspiration

IIIF Image API (http://iiif.io/api/image/2.0) specifies a web service that returns an image. The HTTP request can specify the region, size, rotation, quality characteristics and format of the requested image
-> http://gallica.bnf.fr/iiif/ark:/12148/bpt6k65372641/f1/1165.4351015801358,833.7189616252821,969.8363431151238,964.1647855530472/171,170/0/native.jpg

EPUB format as a recommended specification on Fragment Identifiers ( http://www.idpf.org/epub/linking/cfi/epub-cfi.html) that helps to express paths to specific locations within the content:
->
book.epub#epubcfi(/6/4[chap01ref]!/4[body01]/10[para05]/3:10)

Related work:
http://pro.europeana.eu/blogpost/europeana-aligns-with-the-international-image-interoperability-framework-iiif
http://pro.europeana.eu/files/Europeana_Professional/Projects/Project_list/Europeana_Cloud/Deliverables/D4.4%20Recommendations%20For%20Enhancing%20EDM%20to%20Support%20Research%20Oriented%20Content.pdf

Actions

  1. Use cases survey
  2. Contact with IIIF ?
  3. Syntax specs

ALTO 4.0: adaptation of "Processing" substructure

As outlined on the last conference call it was figured out by CCS development team an issue on the new "Processing" node defintion on implementing the new schema version to our product.

The current defined structure is referencing from "Processing" node the complexType "processingStepType":
alto_4-0

Intended was the reference of the complexType "processingType" with the classification of the processing operation:
alto_4-0_mod
This defined type is not referenced now at all even existing in the schema.

The proposal is just to add the classification missed now as element parallel to the other information which keeps the hierarchy flat as it is now and does not cause a big issue to add this classification as additional property to the "processingStepType" instead of the additional node level like this:

alto_4-1_prop

This will remain compatible to 4-0 and will just allow the classification as it was intended from the discussion to cluster the operations commonly if applicable.

Finally also Jukka proposed to add the reference to the processing steps elements not just on page level than also to the sub structures to outline e.g. strings reprocessed with additional operations.

The modified schema result I will upload now as 4-1 (draft) as proposal result for review.

Glyphs (IMPACT)

Submitter: Impact
Submitted: 2013-02

use case
Modern OCR software stores information on glyph level. A glyph is essentially a character or ligature. Each character has its own coordinate information and must be separately addressable as a distinct object. Correction and verification processes can be carried out for individual characters. Post-OCR analysis of the text as well as adaptive OCR algorithm must be able to record information on character level.

In order to reproduce the decision of the OCR software, optional characters must be recorded. These are called variants. The OCR software evaluates each variant and picks the one with the highest confidence score as the glyph. The confidence score expresses how confident the OCR software is that a single glyph had been recognized correctly.

implementation
Glyphs are recorded in the element. This element is optional and a child element of . The glyph element may have a element (see above). The (recognized) character of the glyph is stored in the CONTENT attribute.

The glyph’s CONTENT attribute is no replacement for the string’s CONTENT attribute. Due to post-processing steps such as correction the values of both attributes may be inconsistent.

Each element may have an optional VALID attribute. This attribute may only have one of the following three values:

•“s” - expresses that the glyph is a suspicious character. The OCR software is not confident that it has recognized the glyph correctly.
•“r” – the character has been rejected; the OCR is confident that this character is not the glyph.
•“c” - The OCR software is not confident that it has recognized the glyph correctly.
Each may have one or more elements. Each variant represents an option for the glyph that the OCR software could have chosen. The element’s VC attribute records a float value between 0 and 1 that expresses the level of confidence for the variant where is 1 is confident. This attribute is optional. If it is not available, the default value for the variant is “0”. The VC attribute’s semantic is similar to the WC attribute for the element.

example

<TextBlock ID="P4_TB00001">
  <TextLine ID="P4_TL00001">
    <Shape>
      <Rectangle HPOS="230" VPOS="216" WIDTH="987" HEIGHT="31" />
    </Shape>
    <String ID="P4_ST00001" CONTENT="12" WC="0.99" CC="02">
      <Shape>
        <Rectangle HPOS="230" VPOS="223" WIDTH="37" HEIGHT="24"/>
      </Shape>
      <Glyph ID="P4_ST00001_G01"  CONTENT="1" VALID="s" HPOS="230" VPOS="223" WIDTH="10" HEIGHT="24">
       <Shape>
        <Polygon  />
       </Shape>
       <Variant VC="0.2">l</Variant>
       <Variant VC="0.1">i</Variant>
     </Glyph>
     <Glyph ID="P4_ST00001_G02" CONTENT="2" HPOS="240" VPOS="223" WIDTH="10" HEIGHT="24"/>
       <Shape>
         <Polygon />
       </Shape>
       <Variant VC="0.5">s</Variant>
       <Variant VC="0.1">8</Variant>
     </Glyph>
    </String>
  </TextLine>
</TextBlock> 

Proposed change (inital draft):

<xsd:complexType name="StringType" mixed="false">
  <xsd:annotation>
    <xsd:documentation>A sequence of chars. Strings are separated by     white spaces or hyphenation chars.</xsd:documentation>
  </xsd:annotation>
  <xsd:sequence minOccurs="0">
    <xsd:element name="Shape" type="ShapeType" minOccurs="0"/>
    <xsd:element name="Alternative" minOccurs="0" maxOccurs="unbounded">
    ..............
    <xsd:element name="Glyph" type="GlyphType" minOccurs="0" maxOccurs="unbounded"/>
</xsd:sequence> 
  <xsd:complexType name="GlyphType" mixed="false">
  <xsd:annotation> 
    <xsd:documentation>
      Modern OCR software stores information on glyph level. A glyph is essentially a character or ligature. 
      Each character has its own coordinate information and must be separately addressable as a distinct object.
      Correction and verification processes can be carried out for individual characters.
      Post-OCR analysis of the text as well as adaptive OCR algorithm must be able to record information on character level.
      In order to reproduce the decision of the OCR software, optional characters must be recorded. These are called variants.
      The OCR software evaluates each variant and picks the one with the highest confidence score as the glyph.
      The confidence score expresses how confident the OCR software is that a single glyph had been recognized correctly.

      The glyph elements are in order of the word. Each character need to be recoreded to built up the whole word sequence.

      The glyph’s CONTENT attribute is no replacement for the string’s CONTENT attribute.
      Due to post-processing steps such as correction the values of both attributes may be inconsistent. 

    </xsd:documentation>
  </xsd:annotation>
  <xsd:sequence minOccurs="0">
    <xsd:element name="Shape" type="ShapeType" minOccurs="0"/>
    <xsd:element name="Variant" minOccurs="0" maxOccurs="unbounded">
      <xsd:annotation>
        <xsd:documentation>Any alternative for the glyth.</xsd:documentation>
      </xsd:annotation>
      <xsd:complexType>
        <xsd:simpleContent>
          <xsd:extension base="xsd:string">
            <xsd:attribute name="VC" type="xsd:float" use="optional">
            <xsd:annotation>
              <xsd:documentation>
                 Each variant represents an option for the glyph that the OCR software could have chosen.
                 The element’s VC attribute records a float value between 0 and 1 that expresses
                 the level of confidence for the variant where is 1 is confident.
                 This attribute is optional. If it is not available, the default value for the variant is “0”.

                 The VC attribute’s semantic is similar to the WC attribute for the String element.
              </xsd:documentation>
            </xsd:annotation>
          </xsd:attribute>
          <xsd:simpleType>
            <xsd:restriction base="xsd:float">
              <xsd:minInclusive value="0"/>
              <xsd:maxInclusive value="1"/>
            </xsd:restriction>
          </xsd:simpleType>
        </xsd:extension>
      </xsd:simpleContent>
      </xsd:complexType>
    </xsd:element>
  </xsd:sequence>
  <xsd:attribute name="ID" type="xsd:ID" use="optional"/>
  <xsd:attribute name="CONTENT" use="required">
    <xsd:simpleType>
      <xsd:restriction base="xsd:string">
        <xsd:length fixed="true" value="1"/>
        <xsd:whiteSpace value="preserve"/>
      </xsd:restriction>
    </xsd:simpleType>
  </xsd:attribute>
  <xsd:attribute name="VALID">
    <xsd:simpleType>
      <xsd:restriction base="xsd:string">
        <xsd:enumeration value="s"/>
        <xsd:enumeration value="r"/>
        <xsd:enumeration value="c"/>
      </xsd:restriction>
    </xsd:simpleType>
  </xsd:attribute>
</xsd:complexType>    

Identification of running title

NZ Micrographics on behalf of The State Library of NSW has asked for a means of identifying running title in ALTO. For example, "Notable Men of Wales" here. There is some indication in the schema of where a running title is located on the page.

<xsd:element name="TopMargin" type="PageSpaceType" minOccurs="0">
  <xsd:annotation>
    <xsd:documentation>
      The area between the top line of print and the upper edge of the leaf. It may contain 
      page number or running title.
    </xsd:documentation>
   </xsd:annotation>
</xsd:element>

Perhaps RunningTitle could be a TextBlockType?

Change BASELINE to accommodate a list of points in addition to a single point

Günter Mühlberger and Structify colleagues at University of Innsbruck would like to request using a list of points for the BASELINE instead of one single point. So changing from

< xsd:attribute name="BASELINE" type="xsd:float" use="optional"/>
to
< xsd:attribute name="BASELINE" type="PointsType" use="optional"/>

Moreover, for handwritten text it could be useful to have more than one BASELINE for a single text line, e.g when a text was crossed and overwritten.

The first marked text below shows a line with logically two base lines. The word above the line belongs logically to the same line. So this is the reason why we would like to have several base lines for one line.

The marked text number 2 shows why the baseline realised as polyline is such important when dealing with handwritten or distorted text.

altobaselinerequest

Length of main glyph and variants

Separated from #26 (comment)
Glyph variants: The main glyphs are restricted to length 1 but variants to length 3. This could be a bit inconvenient when dealing with OCR results. Say FineReader returns 5 options, some with length 1 and some longer. What happens if the first one is not of length 1, does the ALTO exporter tool then check if there is one with length 1 among the other options and change the order? And why three? For Latin that would probably cover most cases, but for other scripts there might be longer ones.

ALTO & IIIF integration

textBlock "writing-mode" proposal

It is convenient for CJK users to distinguish direction of text. For example,
writing-mode: horizontal-tb; // YOKO-GAKI for Japanese
writing-mode: vertical-rl; // TATE-GAKI for Japanese

Multiple Shape elements in one TextLine

Hello,

Considering this change between Alto v3.0 and v3.1 :

    <xsd:element name="TextLine" maxOccurs="unbounded">
        <xsd:annotation>
            <xsd:documentation>A single line of text.</xsd:documentation>
        </xsd:annotation>
        <xsd:complexType>
            <xsd:sequence>
                <xsd:sequence maxOccurs="unbounded">
+                   <xsd:element name="Shape" type="ShapeType" minOccurs="0" maxOccurs="1"/>
                    <xsd:element name="String" type="StringType"/>
                    <xsd:element name="SP" type="SPType" minOccurs="0"/>
                </xsd:sequence>
                (...)
            </xsd:sequence>
            (...)
        </xsd:complexType>
    </xsd:element>

I guess this is relative to the following in the v3.1 changelog :

.2. Added support for using different shapes for the elements String, TextLine, all PageSpaceType elements and on all BlockType elements.

I see a problem here, which is multiple shape elements can be direct children of a TextLine. According to the schema, the following constructions are allowed :

Ex. 1: No Shape element in the TextLine ✅

<TextLine>
    <String />
    <SP />
    <String />
    <SP />
    <String />
</TextLine>

Ex. 2: One Shape element at the beginning of the TextLine ✅

<TextLine>
    <Shape />
    <String />
    <SP />
    <String />
    <SP />
    <String />
</TextLine>

Ex. 3: One Shape element before each String element of the TextLine ❗

<TextLine>
    <Shape />
    <String />
    <SP />
    <Shape />
    <String />
    <SP />
    <Shape />
    <String />
</TextLine>

In the 3rd situation, which Shape element should be selected as the correct shape of the line ?

I suggest that TextLine can have at most one Shape child element, at the beginning of the sequence, like this :

    <xsd:element name="TextLine" maxOccurs="unbounded">
        <xsd:annotation>
            <xsd:documentation>A single line of text.</xsd:documentation>
        </xsd:annotation>
        <xsd:complexType>
            <xsd:sequence>
+               <xsd:element name="Shape" type="ShapeType" minOccurs="0" maxOccurs="1"/>
                <xsd:sequence maxOccurs="unbounded">
                    <xsd:element name="String" type="StringType"/>
                    <xsd:element name="SP" type="SPType" minOccurs="0"/>
                </xsd:sequence>
                (...)
            </xsd:sequence>
            (...)
        </xsd:complexType>
    </xsd:element>

MeasurementUnit - annotation change and define as mandatory

On NPO project the definition of the values in the ALTO were discussed.

The concern was about the annotation of "default" definition in there.

Due to that I also would like to change it as required value, but would cause a problem on backwards compatibility.

Here the copy of the discussion from NPO platform regarding the arguments noted down by me (Jo) for following up

ALTO - messurement unit

Hi all,

according to​ our call today about the units for the values in ALTO we had following arguments for the best-practise recommendation for "pixel".

  1. "pixel" is the smallest unit in digitized images
  2. is according to major digitization projects for newspapers (NDNP, KBNL)
  3. there is a weak point for re-calculation in case of adaptions of refered images, but this can be easily covered with
    WIDTH / HEIGHT information of the PAGE element.
    By that the scaling factor can be calculated and zones easily made matching to the related image,
    even resolution was changed
  4. proposal will be placed at the ALTO board to add the recording of the initial resolution of refered image

Based on this I also propose, that the annotation at the MessurementUnit "The default is 1/10 of mm." will be removed or replaced.

I am not sure if the annotation should be exended with any other statement.

Here my proposal for the annotation after having though another 3 min on it:

"All measurement values inside the alto file except fontsize are related to this unit.
The values for pixel will be related to the resolution of the image based on which the lazout is described. Incase the original image is not known the scaling factor can be calculated based on total width and height of the image and the according information of the PAGE element."

<xsd:element name="MeasurementUnit" minOccurs="0">
  <xsd:annotation>
    <xsd:documentation>All measurement values inside the alto file except fontsize
         are related to this unit. The default is 1/10 of mm
    </xsd:documentation>
  </xsd:annotation>
  <xsd:simpleType>
    <xsd:restriction base="xsd:string">
      <xsd:enumeration value="pixel"/>
      <xsd:enumeration value="mm10"/>
      <xsd:enumeration value="inch1200"/>
    </xsd:restriction>
  </xsd:simpleType>
</xsd:element>
<xsd:element name="MeasurementUnit" minOccurs="1">
  <xsd:annotation>
    <xsd:documentation>All measurement values inside the alto file except fontsize
         are related to this unit. 
         The values for pixel will be related to the resolution of the image based
         on which the layout is described. Incase the original image is not known
         the scaling factor can be calculated based on total width and height of
         the image and the according information of the PAGE element.
         pixel: 1 pixel
         mm10: 1/10 of millimeter
         inch1200: 1/1200 of inch
    </xsd:documentation>
  </xsd:annotation>
  <xsd:simpleType>
    <xsd:restriction base="xsd:string">
      <xsd:enumeration value="pixel"/>
      <xsd:enumeration value="mm10"/>
      <xsd:enumeration value="inch1200"/>
    </xsd:restriction>
  </xsd:simpleType>
</xsd:element>
``

Add Processing to replace OCRProcessing

The current process recording elements are fixed with OCR and on the other hand bit redundand. I think it would make sense to change OCRProcessing to Processing and the preProcessingStep,ocrProcessingStep, postProcessingStep to generic processingStep with processingStepType element to record the type of processing performed.

Currently:

<OCRProcessing ID="OCRPROCESSING_1">
  <preProcessingStep>
    <processingDateTime>2009-10-19</processingDateTime>
    <processingAgency>CCS Content Conversion Specialists GmbH, 
    </processingAgency>
    <processingStepDescription>align</processingStepDescription>
    <processingStepSettings>CCS OCR Processing Filter</processingStepSettings>
     <processingSoftware>
         <softwareCreator>CCS Content Conversion Specialists GmbH,Germany</softwareCreator>
         <softwareName>CCS docWORKS</softwareName>
         <softwareVersion>6.3-0.91</softwareVersion>
         <applicationDescription/>
       </processingSoftware>
    </preProcessingStep>
    <ocrProcessingStep>
    <processingSoftware>
    <softwareCreator>ABBYY (BIT Software), Russia</softwareCreator>
      <softwareName>FineReader</softwareName>
      <softwareVersion>8.1</softwareVersion>
    </processingSoftware>
  </ocrProcessingStep>
</OCRProcessing>

Suggestion

<Processing>
  <ProcessingStep ID="01">
    <processingDateTime>2009-10-19T10:10:10+05:00</processingDateTime>
    <processingStepType>image processing</processingStepType>
    <processingAgency>ACME Processing</processingAgency>
    <processingStepDescription>align</processingStepDescription>
    <processingStepSettings>ACME OCR Processing Filter</processingStepSettings>
    <processingSoftware>
      <softwareCreator>CCS Content Conversion Specialists GmbH, Germany</softwareCreator>
      <softwareName>CCS docWORKS</softwareName>
      <softwareVersion>6.3-0.91</softwareVersion>
      <softwareDescription/>
    </processingSoftware>
  </ProcessingStep>
  <ProcessingStep ID="02">
    <processingDateTime>2009-10-19T10:21:14+05:00</processingDateTime>
    <processingStepType>OCR</processingStepType>
    <processingAgency>CCS Content Conversion Specialists GmbH, www.content-conversion.com</processingAgency>
    <processingStepDescription></processingStepDescription>
    <processingStepSettings></processingStepSettings>
    <processingSoftware>
      <softwareCreator>ABBYY (BIT Software), Russia</softwareCreator>
      <softwareName>FineReader</softwareName>
      <softwareVersion>8.1</softwareVersion> 
      <softwareDescription/>
    </processingSoftware>
  </ProcessingStep>
  <ProcessingStep ID="03">
     <processingDateTime>2009-10-19T15:28:30+05:00</processingDateTime>
     <processingStepType>Proofreading</processingStepType>
     <processingAgency>ACME Corp.</processingAgency>
     <processingStepDescription></processingStepDescription>
     <processingStepSettings></processingStepSettings>
     <processingSoftware>
        <softwareCreator>ACME</softwareCreator>
        <softwareName>Proofreader</softwareName>
       <softwareVersion>9.9</softwareVersion>
       <softwareDescription/>
     </processingSoftware>
   </ProcessingStep>
</Processing>

Schema changes:

<xsd:element name="OCRProcessing" minOccurs="0" maxOccurs="unbounded">
+  <xsd:annotation>
+    <xsd:documentation>DEPRECATED: Processing element should be used instead. 
+  </xsd:documentation>
 <xsd:complexType>
   <xsd:complexContent>
     <xsd:extension base="ocrProcessingType">
       <xsd:attribute name="ID" type="xsd:ID" use="required"/>
     </xsd:extension>
   </xsd:complexContent>
</xsd:complexType>


+<xsd:element name="Processing" minOccurs="0" maxOccurs="unbounded">
+  <xsd:complexType>
+     <xsd:complexContent>
+       <xsd:extension base="ProcessingStepType">
+         <xsd:attribute name="ID" type="xsd:ID" use="required"/>
+       </xsd:extension>
+      </xsd:complexContent>
+  </xsd:complexType>


<xsd:complexType name="ProcessingStepType">
<xsd:annotation> 
  <xsd:documentation>A processing step.</xsd:documentation>
</xsd:annotation>
 <xsd:sequence>

+  <xsd:element name="processingStepType" type="xsd:string" minOccurs="0"> 
+   <xsd:annotation>
+    <xsd:documentation>Type of processing step</xsd:documentation>
+   </xsd:annotation>
+  </xsd:element>

  <xsd:element name="processingDateTime" type="dateTimeType" minOccurs="0"> 
   <xsd:annotation>
    <xsd:documentation>Date or DateTime the image was processed.</xsd:documentation>
   </xsd:annotation>
  </xsd:element>
  <xsd:element name="processingAgency" type="xsd:string" minOccurs="0">
   <xsd:annotation>
    <xsd:documentation>Identifies the organizationlevel producer(s) of the
      processed image.</xsd:documentation>
   </xsd:annotation>
  </xsd:element>
  <xsd:element name="processingStepDescription" type="xsd:string" minOccurs="0" maxOccurs="unbounded">
   <xsd:annotation>
    <xsd:documentation>An ordinal listing of the image processing steps performed.
        For example, "image despeckling."</xsd:documentation>
   </xsd:annotation>
  </xsd:element>
  <xsd:element name="processingStepSettings" type="xsd:string" minOccurs="0">
   <xsd:annotation>
    <xsd:documentation>A description of any setting of the processing application.
        For example, for a multi-engine OCR application this might include the
        engines which were used. Ideally, this description should be adequate so
        that someone else using the same application can produce identical
        results.</xsd:documentation>
   </xsd:annotation>
  </xsd:element>
  <xsd:element name="processingSoftware" type="processingSoftwareType" minOccurs="0"/>
  </xsd:sequence>
</xsd:complexType> 

Same namespace for v2.0 and v2.1 despite broken backwards compatibility

When implementing ALTO 2.1 in our tools we stumbled upon the fact that a number of changes were introduced which break backwards compatibility of the format(*) while the namespace was kept the same. Or in other words, ALTO 2.1 files cannot (always) be used where 2.0 files were used (which would be required for backwards compatibility) but there is no way of telling the two versions apart. The change of coordinates from integer to float, for instance, will break ALTO 2.0 systems.

Practical issues related to this are (only to mention a few):

  • the inability to identify the version und thus supported features when reading ALTO files
  • how a tool can save ALTO files in the same format as they were opened from (not to break backwards compatibility and render the files unreadable for existing software)
  • XSL transformations involving both versions cannot be performed since the same namespace cannot be bound to two (or more) different schemas

There does not seem to be a separate version number element or attribute either so at the moment it is a real issue how files of different versions can be distinguished and how it can be avoided to break existing workflows and systems which are not (and perhaps never will be) aware of the latest changes when instances in the new format start circulating alongside old ones.

I would be very much interested in what others think of this, potential workarounds and whether that should be considered a bug to be addressed in the next update.

(*) It is important to note that this is about the format and not future (ALTO v2.1 enabled) systems which could be considered automatically backwards compatible due to ALTO 2.0 being more or less a subset of 2.1.

Recommendation for link to ALTO in iiif manifest

The iiif defines a Presentation API that allows the representation of - where available - OCR results in ALTO as annotations, linked by a manifest.

Example:

seeAlso: {
@id: "http://wellcomelibrary.org/service/alto/b19956435/0?image=0",
format: "application/alto+xml", 
profile: "http://www.loc.gov/standards/alto/",
label: "ALTO"\
}

It would be good to have a recommendation from the ALTO board on the values for two fields, format and label. The format should resemble a MIME-type, e.g. application/xml or text/xml, while the later can be a simple text like "ALTO XML", "ALTO OCR" or similar.

Consistent data type definition for positions and dimensions (int vs float)

ALTO schema allowes different messurement units.
For pixels no floating values are required, as the pixel is the smallest unit you can describe on.

But while also inch1200 (1/1200 of inch) and 10mm (10th of mm) are possible, integer values are not sufficient to describe the exact position of a point.

To ensure that each point can be described as precised as the information is existing (e.g. digital born) as well as to simplify the standard and make easy to get it, I recommend to make it consistent to "xsd:float" for all HPOS, VPOS, HEIGHT and WIDTH attributes.

Changing xsd:int to xsd:float is backwards compatible. Values even without ".0" are valid!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.