GithubHelp home page GithubHelp logo

Comments (4)

benlabbe avatar benlabbe commented on July 22, 2024

Here are the first elements of my investigations.

  • The failure of the SVMTagger leads to a LinguisticProcessingException.
  • In Release mode, this exception is supposed to be caugth by the upper elements in the calling stack . But its not.
  • I found that the compiling option WITH_DEBUG_MESSAGES acts not as expected.
    • The macro flag DEBUG_LP which enables the catching of exceptions is erroneously defined in Release mode.
  • I propose a correction in SetCompilerFlags.cmake which defines WITH_DEBUG_MESSAGES as a cmake option.
  • With this correction , the paragraphs (engText) responsible of the crashes in my input XML file are aborted and correctly closed in the .mult output file : no content, but some properties are reported by readMultFile for these nodes.
  • the following paragraphs (engText) in my input XML file are correctly processed up to the last one, as seen in the .mult file.
  • the document is correctly closed in the .mult file

Here is a sample XML file causing SVMTagger to crash : 02552_GS_RC_MEC_682_EN_00.xml
Sample error log after my correction in SetCompilerFlags.cmake :

user:home$ analyzeXml -l eng -p TechnipTenderXML 02552_GS_RC_MEC_682_EN_00.xml
 : LP::PosTagger : 2021-12-09T15:26:39.586 ERROR 0x55e55a2fe5a0 Error in SVMTagger result line: did not get 2 elements in '  ' 
 : LP::CoreClient : 2021-12-09T15:26:39.586 ERROR 0x55e55a2fe5a0 "/home/bl231006/WORK/Aymara/lima/lima_linguisticprocessing/src/linguisticProcessing/core/CoreLinguisticProcessingClient.cpp:255: analysis failed : receive status 1 from pipeline. exit" 
 : XML::DocumentsReader : 2021-12-09T15:26:39.587 ERROR 0x55e55a2fe5a0 StructuredDocumentXMLParser::endElement: error while handeling indexing element "engTEXT" absolute offset: 6149 
 : LP::PosTagger : 2021-12-09T15:26:39.782 ERROR 0x55e55a2fe5a0 Error in SVMTagger result line: did not get 2 elements in '  ' 
 : LP::CoreClient : 2021-12-09T15:26:39.782 ERROR 0x55e55a2fe5a0 "/home/bl231006/WORK/Aymara/lima/lima_linguisticprocessing/src/linguisticProcessing/core/CoreLinguisticProcessingClient.cpp:255: analysis failed : receive status 1 from pipeline. exit" 
 : XML::DocumentsReader : 2021-12-09T15:26:39.782 ERROR 0x55e55a2fe5a0 StructuredDocumentXMLParser::endElement: error while handeling indexing element "engTEXT" absolute offset: 10389 
 : LP::PosTagger : 2021-12-09T15:26:41.809 ERROR 0x55e55a2fe5a0 Error in SVMTagger result alignement with analysis graph: got ' . : SENT ' from SVMTagger and ' "\n\n." ' from graph 
 : LP::CoreClient : 2021-12-09T15:26:41.810 ERROR 0x55e55a2fe5a0 "/home/bl231006/WORK/Aymara/lima/lima_linguisticprocessing/src/linguisticProcessing/core/CoreLinguisticProcessingClient.cpp:255: analysis failed : receive status 1 from pipeline. exit" 
 : XML::DocumentsReader : 2021-12-09T15:26:41.810 ERROR 0x55e55a2fe5a0 StructuredDocumentXMLParser::endElement: error while handeling indexing element "engTEXT" absolute offset: 52927 
 : LP::PosTagger : 2021-12-09T15:26:41.940 ERROR 0x55e55a2fe5a0 Error in SVMTagger result alignement with analysis graph: got ' .5 : NOUN ' from SVMTagger and ' "\n.5" ' from graph 
 : LP::CoreClient : 2021-12-09T15:26:41.940 ERROR 0x55e55a2fe5a0 "/home/bl231006/WORK/Aymara/lima/lima_linguisticprocessing/src/linguisticProcessing/core/CoreLinguisticProcessingClient.cpp:255: analysis failed : receive status 1 from pipeline. exit" 
 : XML::DocumentsReader : 2021-12-09T15:26:41.940 ERROR 0x55e55a2fe5a0 StructuredDocumentXMLParser::endElement: error while handeling indexing element "engTEXT" absolute offset: 55901 
 : LP::PosTagger : 2021-12-09T15:26:41.969 ERROR 0x55e55a2fe5a0 Error in SVMTagger result alignement with analysis graph: got ' . : SENT ' from SVMTagger and ' "\n\n." ' from graph 
 : LP::CoreClient : 2021-12-09T15:26:41.969 ERROR 0x55e55a2fe5a0 "/home/bl231006/WORK/Aymara/lima/lima_linguisticprocessing/src/linguisticProcessing/core/CoreLinguisticProcessingClient.cpp:255: analysis failed : receive status 1 from pipeline. exit" 
 : XML::DocumentsReader : 2021-12-09T15:26:41.970 ERROR 0x55e55a2fe5a0 StructuredDocumentXMLParser::endElement: error while handeling indexing element "engTEXT" absolute offset: 56480 
Total: 5317 ms

02552_GS_RC_MEC_682_EN_00.xml.zip

from lima.

benlabbe avatar benlabbe commented on July 22, 2024

The recovery on error is handled in Release mode thanks to the fix on WITH_DEBUG_MESSAGES in commit e8e2e11 .
This allows to process large XML files where each page is a node (engText) with a minimized impact on the final result

The SVMTag crash is still not solved.

from lima.

kleag avatar kleag commented on July 22, 2024

Solved in commit 876c293:

gael@brezhoneg2:~/Téléchargements$ echo -e "\n.0" > test.txt; analyzeText --language=eng test.txt
Analyzing 1/1 (100.00%) 'test.txt' : LP::PosTagger : 2021-12-16T00:05:51.228  WARN 0x56012aba6760 Error in SVMTagger. Invalid token with newline(s): "\n.0" 
 : LP::PosTagger : 2021-12-16T00:05:51.228  WARN 0x56012aba6760 Avoiding the problem but the tokenizer should be checked. 
 : LP::PosTagger : 2021-12-16T00:05:51.228  WARN 0x56012aba6760 Error in SVMTagger. Invalid token with newline(s): "\n.0" 
 : LP::PosTagger : 2021-12-16T00:05:51.228  WARN 0x56012aba6760 Avoiding the problem but the tokenizer should be checked. 
 : LP::PosTagger : 2021-12-16T00:05:51.228  WARN 0x56012aba6760 No matching category found for tagger result  ".0"   "NOUN" 
 : LP::PosTagger : 2021-12-16T00:05:51.228  WARN 0x56012aba6760 Taking any one 
# global.columns = ID   FORM    LEMMA   UPOS    XPOS    FEATS   HEAD    DEPREL  DEPS    MISC
# sent_id = 1
# text =  .0 
1       \x0a.0
.0      NUM     _       _       _       _       _       NE=I-Numex.NUMBER|Pos=1|Len=3

gael@brezhoneg2:~/Téléchargements$ echo -e "some text.\n.\n" > test.txt; analyzeText --language=eng test.txt
Analyzing 1/1 (100.00%) 'test.txt' : LP::PosTagger : 2021-12-16T00:06:10.981  WARN 0x5614edead760 Error in SVMTagger. Invalid token with newline(s): ".\n." 
 : LP::PosTagger : 2021-12-16T00:06:10.982  WARN 0x5614edead760 Avoiding the problem but the tokenizer should be checked. 
 : LP::PosTagger : 2021-12-16T00:06:10.982  WARN 0x5614edead760 Error in SVMTagger. Invalid token with newline(s): ".\n." 
 : LP::PosTagger : 2021-12-16T00:06:10.982  WARN 0x5614edead760 Avoiding the problem but the tokenizer should be checked. 
 : LP::PosTagger : 2021-12-16T00:06:10.982  WARN 0x5614edead760 No matching category found for tagger result  ".\u200B."   "NOUN" 
 : LP::PosTagger : 2021-12-16T00:06:10.982  WARN 0x5614edead760 Taking any one 
# global.columns = ID   FORM    LEMMA   UPOS    XPOS    FEATS   HEAD    DEPREL  DEPS    MISC
# sent_id = 1
# text = some text.
1       some    some    DET     _       _       2       det     _       Pos=1|Len=4
2       text    text    NOUN    _       NUMBER=SING     3       Dummy   _       Pos=6|Len=4|SpaceAfter=No
3       .\x0a.  .
.       SENT    _       _       0       _       _       Pos=10|Len=3

But it does not solve the underlying tokenizer error.

from lima.

benlabbe avatar benlabbe commented on July 22, 2024

Dear @kleag ,

I got a new example that crashes the SVMPosTagger. The malicious characters are the succession of three dots : "..." .
I managed to overcome the issue by replacing in the analyzed text with the unicode 2026 + two spaces : "… ".

from lima.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.