Comments (7)
After converting a document from docx to FoLiA using Piereling (@proycon: I did not find a command line option for such a conversion),
Right, Piereling first invokes pandoc to convert docx to rst and then it uses rst2folia to convert the rst to FoLiA, so it'd be a two-step process on the command line.
foliavalidator rejects both files, so the error is not introduced by ucto.
That's not what I reproduced here:
$ foliavalidator bla.folia.xml
WARNING: Document (bla.folia.xml) uses an older FoLiA version (2.0.0) but is validated according to the newer specification (2.5.0). You might want to increase the version attribute if this is a document you created and intend to publish.
Validated successfully: bla.folia.xml
$ foliavalidator bla_ucto.folia.xml
VALIDATION ERROR on full parse by library (stage 2/3), in bla_ucto.folia.xml
ParseError: FoLiA exception in handling of <div> @ line 48 (in parent <text> @ parent line 47) : [InconsistentText] Text for <Paragraph at 139634925662464 id=MWG-Gesamtpersonenverzeichnis_2019-09-25.text.div.1.div.2.p.75 set=None class=None>, is inconsistent: EXPECTED (deep text after normalization) *****>
Corelli , Arcangelo (I/14) (17.2.1653–8.1.1713). Komponist, Violinvirtuose. Wurde 17jährig in die Academia filarmonica in Bo logna aufgenommen, 1687 „Maestro di Musica“ des Kardinals Benedetto Panfili in Rom. 1700 von Kardinal Pietro Ottoboni, dem Neffen des Papstes Alexander VIII., zum Haupt der Instrumentisten der „Academia di Santa Cecilia“ (somit zum ersten Instrumental-Komponisten Roms) ernannt. Er liegt im Pantheon links neben Raffael begraben. Die Zeitgenossen verehrten ihn als „Princeps musicorum“
, „Maestro dei Maestri“ und „Virtuosissimo di Violino e vero Orfeo di nostri tempi“. Kompositionsgeschichtlich bedeutend sind seine Concerti grossi, Trio- und Violinsonaten.
****> BUT FOUND (strict text after normalization) ****>
Corelli , Arcangelo (I/14) (17.2.1653–8.1.1713). Komponist, Violinvirtuose. Wurde 17jährig in die Academia filarmonica in Bologna aufgenommen, 1687 „Maestro di Musica“ des Kardinals Benedetto Panfili in Rom. 1700 von Kardinal Pietro Ottoboni, dem Neffen des Papstes Alexander VIII., zum Haupt der Instrumentisten der „Academia di Santa Cecilia“ (somit zum ersten Instrumental-Komponisten Roms) ernannt. Er liegt im Pantheon links neben Raffael begraben. Die Zeitgenossen verehrten ihn als „Princeps musicorum“
, „Maestro dei Maestri“ und „Virtuosissimo di Violino e vero Orfeo di nostri tempi“. Kompositionsgeschichtlich bedeutend sind seine Concerti grossi, Trio- und Violinsonaten.
******* DEVIATION POINT: nica in Bo<*HERE*>logna auf
(also checked against older rules prior to FoLiA v2.4.1)
@kosloot So it rejects the ucto output but the input is valid. So that would make it an ucto issue. I see you already identified the problem even.
Btw, can one simply call python-ucto on a folia.Paragraph and access sentences too, next to tokens, with foliapy?
Yes, the wrapper should support folia input and output. (though the input should be a full document I think)
from ucto.
Addition: Opening this file in Emacs shows a hyphen (-) symbol. more and vi will display a space. less will show <U+00AD>
So even for these tools there is no consensus.
Maybe the most elegant way is to have rst2folia intoduce a <t-hbr class="soft"/>
here?
But still some other tool might slip in these symbols, so our libraries should somehow handle them.
@proycon in libfolia you added a parameter to the normalize_spaces() function to replace all Control Characters by a single space. The Soft Hyphen IS a Control Character. Hence the normalization to a space.
apparently FoliaPy doesn't do this?
So one conclusion is already that the [FILTER]
rules play no role here. Which is a relief.
from ucto.
hmm, this is "interesting".
@proycon foliavalidator rejects both files, so the error is not introduced by ucto.
folialint accepts both files
The offending space is C2 AD, the soft hyphen. Ucto silently discards those. But NOT in the original text of the paragraph. Which leads to this trouble. Defining an outputclass in ucto might solve this, but is also not ideal (it cannot be 'current' then.)
We should think about the best strategy here for ucto.
In general it might be better to discard soft hyphens as soon as possible, also in piereling or rst2folia?
@pirolen I don't see a Linebreak problem in these files...?
from ucto.
@pirolen I don't see a Linebreak problem in these files...?
I think ucto generated a declaration like this:
<sentence-annotation>
<annotator processor="ucto.1"/>
</sentence-annotation>
<linebreak-annotation/>
about which I got an error message (either by foliavalidator, or when using foliapy -- cannot reproduce right now).
from ucto.
Btw, can one simply call python-ucto on a folia.Paragraph and access sentences too, next to tokens, with foliapy?
from ucto.
That's not what I reproduced here:
Indeed, I stand corrected.
So it boils down to determine what to do with the Soft Hyphen. ucto discards those. Which always seemed a good plan. But do we want to remove them from the original <t>
too? (and also modify that text with other replacements from the [FILTER] rule in tokconfig-* ?)
Alternatively we could choose to
- NOT apply the [FILTER] rules, (except when an alternative outputclass is specified)
- OR: adapt foliavalidator to accept this, like folialint assumably does: just normalize a soft hyphen to a space for textcomparision.
- OR have all folia implementations (normalizers?) completely remove the soft-hyphen
NOTE: libfolia maps a lot 'space-like' characters to a normal space, so in general Ucto will NOT receive the Soft Hyphen at all. But I have to check this.
Still the best solution might be to remove them in an earlier stage, as the are a big pita.
See also: is it a space or?
from ucto.
So I propose the following solution:
- First and for all, our tools should NOT create FoLiA with embedded Soft-Hyphens, as that leads to all kind of confusions
- When unfortunately a Soft-Hyphen appears in FoLiA, we in general leave it untouched. So NO replacement or deletion.
a consequence is, that this FoLiA will look different in different tools (Emacs, less, vi, more FLAT maybe?) - FoLiA tools might try to resolve Soft-Hypens whatever they like, but must be very careful to maintain text consistency.
This means that as far as I can see, libfolia needs a small change to exempt Soft-Hyphen from normalize_spaces().
foliavalidator seems to be ok. But I strongly suggest adaptations to rst2folia and such to avoid Soft-Hyphens.
As a consequence of this libfolia adaptation, we will see that tools like ucto and frog will create output which seemingly contains spaces. (as some tools will show a space for a Soft-Hyphen).
I can live with that.
Example, for:
<p xml:id="MWG.p.2">
<t>des Hammer klaviers</t>
</p>
With a Soft-Hyphen between 'Hammer' and 'klavier'
ucto will create:
<p xml:id="MWG.p.2">
<t>des Hammerklaviers</t>
<s xml:id="MWG.p.2.s.1">
<w xml:id="MWG.p.2.s.1.w.1" class="WORD">
<t>des</t>
</w>
<w xml:id="MWG.p.2.s.1.w.2" class="WORD">
<t>Hammer klaviers</t>
</w>
</s>
</p>
and Frog will produce:
1 des de [des] LID(bep,gen,evmo) 0.400000
2 Hammer klaviers hammer klaviers [hammer klavier][s] N(soort,mv,basis) 0.825312
So 'Hammer klavier' is just ONE word. (which seems right, in fact)
As tools like Mbt, MBMA MBLEM and such are NOT trained with data containing Soft-Hypens, it is very well possible that processing of those words is not optimal. Therefor avoiding them is still the best way. If this gets a really big issue, we could still decide to adapt Frog to remove them.
from ucto.
Related Issues (20)
- passthru mode should not be combined with other language options
- ucto creates invalid folia HOT 2
- Update debian package for v0.21
- Byte-order mark followed by space or tab results in Folia error HOT 7
- is this correct handling of FoLiA paragraphs with embedded Part nodes? HOT 4
- -T full option produces invalid FoLiA HOT 1
- Tokenization of t-style element that has font_typeface Feature HOT 19
- ucto sometimes misses out on the <t> for <p> HOT 3
- IDs in UCTO in concert with tei2folia HOT 3
- Language detection default for 'unknown' language HOT 9
- Ucto with 'detectlanguages' : failure HOT 3
- remove some deprecated options HOT 6
- Ucto aborts on FoLiA creation
- Question: Concatenating word parts at soft hyphens HOT 77
- Develop a tokenizer for Premodern Slavic
- Implement (soft)hyphen handling in Ucto analogues to foliautils
- Ucto fails on some UTF-8 characters in tei2folia generated FoLiA HOT 12
- add a batch option HOT 6
- Setting -m in container does not supress punctuation-based sentence splitting HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ucto.