Submitter: CCS ([email protected])
Submitted: 2013-02
Status: Discussion
Backwards compatible:**Yes (Only Annotation)**
To ALTO Version: ?
For the page / word and character confidence the values for the calculation are not defined in the schema.
To establish a common calculation method the idea was to share the calculation method and to define a common rule for this to make the confidence values comparable.
Here the calculation methods as calculated until now by CCS with docWorks.
Precondition detail:
ABBYY FineReader up to version 7.1: the character confidence range was defined for 28 (good) to 55 (bad)
ABBYY FineReader starting version 8.0: the character confidence range was defined for 0 (good) to 100 (bad)
These ranges have to be transformed into the range defined by ALTO (range 0 to 9; see below). There unsharpness appeares.
CCS continued calculations for WC due to that on more precised values from ABBYY (range 28 - 55 / 0 - 100), Due to that rounding differences can appear on following values of WC from CC within the ALTO!
CC:
The character confidence is defined in ALTO in a scale of "0" to "9" - "0" is best, 9 is worst.
Character Confidence is determined according to ABBYY character confidence.
The results from the Finreader engines are normalized to the ALTO scale of 0 to 9 per character.
e.g. the word FAX - detected 100% ok by OCR engine will have a CC of 000 - one digit for every character.
WC:
Word Confidence is determined based on character level confidence.
The better the character confidence the better the word confidence.
In addition the word confidence is influenced by the dictionary verification.
If a word is found in the dictionary, it increases the word confidence value.
The longer the word, the higher the confidence value.
(Explanation: If a long word (e.g. with 15 characters) is found in dictionary it is pretty sure that the word is correct, while on wrong detected character a match against the dictionary by mistake is unlikely. Short words like 'fun' / 'fan' will both be found in dictionary. There is no improved guarantee by dictionary check, that the right word is detected.)
Due to that also words with 2 or less characters are not checked against the dictionary.
The word confidence is normalized to an interval of "0.00 to "1.00" - "1.00" best, "0.00" worst.
Calculation:
double( (sum CC)/numChar )/1000.0 - normalization to (0,1)
Example:
<String HPOS="5485" VPOS="4654" WIDTH="468" HEIGHT="109" CONTENT="quorum" WC="1.00" CC="211110"/>
<SP HPOS="5953" VPOS="4762" WIDTH="104"/>
<String HPOS="6057" VPOS="4606" WIDTH="524" HEIGHT="132" CONTENT="conliflmg" WC="0.89" CC="110121122"/>
<SP HPOS="6581" VPOS="4762" WIDTH="61"/>
<String HPOS="6643" VPOS="4592" WIDTH="128" HEIGHT="118" CONTENT="of" WC="0.93" CC="02"/>
<SP HPOS="6770" VPOS="4762" WIDTH="52"/>
<String HPOS="6822" VPOS="4635" WIDTH="61" HEIGHT="66" CONTENT="a" WC="0.85" CC="2"/>
<SP HPOS="6883" VPOS="4762" WIDTH="71"/>
<String HPOS="6954" VPOS="4597" WIDTH="468" HEIGHT="137" CONTENT="majority" WC="1.00" CC="12101111"/>
<SP HPOS="7422" VPOS="4762" WIDTH="52"/>
<String HPOS="7474" VPOS="4578" WIDTH="123" HEIGHT="113" CONTENT="of" WC="0.96" CC="01"/>
When a word is in the dictionary, confidence is 1.0, else is computed (mainly average of all “reversed” cc – means for “212” = ((10-2) + (10-1) + (10-2)) / 3 = 25/3 = 8.33, means a WC of 0.83)
For short words, less than 3 chars, the risk is to have incorrect characters. Due to that it is calculated differently. (still pending)
Details:
FR9( FR8.1, FR10 also) : ABBYY character confidence range is between 0-100
The character confidence is normalized to (0,9) . The word confidence is the sum of the characters confidences and in the end this is calculated as an average of the numbers of characters.
Before writing the WC attribute in ALTO, the word confidence is checked against ABBYY dictionary, whenever the word is found in dictionary the confidence increases:
1000 - ((1000 - charConfLevel) / (chars.GetSize()*3));
Otherwise if the word is not found in ABBYY dictionary the initial determined word confidence level is used and normalized to (0,1)
Note:
charConfLevel word confidence - average confidence on character basis.
chars.GetSize number of characters in word
PC:
The Page Confidence is calculated by average dictionary confidence of all alpha-numeric characters.
?
The page confidence is normalized to an interval of "0.00 to "1.00" - "1.00" best, "0.00" worst.
Details:
The confidence is calculated by adding all the confidences of the XMLTexts (sum of character confidence)
set confidenceSum [expr $confidenceSum + $noOfAlphaNumChars * $confidence ]
and in the end the total page confidence is calculated after this formula:
return [ expr $confidenceSum/$pgNoOfAlphaNumChars ]
Note:
confidence- XMLText dictionary confidence
The total characters confidence sum divided by the number of characters on the page, (normalized in the end to (0,1) ) determines the Page Confidence.
If there are zones but no OCR, the returned value is 999 for confidence as for a bad confidence level.
For blank pages the returned value is 100 for confidence – as to full confidence on blank pages.