Comments (4)
Here are the first elements of my investigations.
- The failure of the SVMTagger leads to a LinguisticProcessingException.
- In Release mode, this exception is supposed to be caugth by the upper elements in the calling stack . But its not.
- I found that the compiling option
WITH_DEBUG_MESSAGES
acts not as expected.- The macro flag DEBUG_LP which enables the catching of exceptions is erroneously defined in Release mode.
- I propose a correction in
SetCompilerFlags.cmake
which definesWITH_DEBUG_MESSAGES
as a cmake option. - With this correction , the paragraphs (engText) responsible of the crashes in my input XML file are aborted and correctly closed in the
.mult
output file : no content, but some properties are reported by readMultFile for these nodes. - the following paragraphs (engText) in my input XML file are correctly processed up to the last one, as seen in the
.mult
file. - the document is correctly closed in the
.mult
file
Here is a sample XML file causing SVMTagger to crash : 02552_GS_RC_MEC_682_EN_00.xml
Sample error log after my correction in SetCompilerFlags.cmake
:
user:home$ analyzeXml -l eng -p TechnipTenderXML 02552_GS_RC_MEC_682_EN_00.xml
: LP::PosTagger : 2021-12-09T15:26:39.586 ERROR 0x55e55a2fe5a0 Error in SVMTagger result line: did not get 2 elements in ' '
: LP::CoreClient : 2021-12-09T15:26:39.586 ERROR 0x55e55a2fe5a0 "/home/bl231006/WORK/Aymara/lima/lima_linguisticprocessing/src/linguisticProcessing/core/CoreLinguisticProcessingClient.cpp:255: analysis failed : receive status 1 from pipeline. exit"
: XML::DocumentsReader : 2021-12-09T15:26:39.587 ERROR 0x55e55a2fe5a0 StructuredDocumentXMLParser::endElement: error while handeling indexing element "engTEXT" absolute offset: 6149
: LP::PosTagger : 2021-12-09T15:26:39.782 ERROR 0x55e55a2fe5a0 Error in SVMTagger result line: did not get 2 elements in ' '
: LP::CoreClient : 2021-12-09T15:26:39.782 ERROR 0x55e55a2fe5a0 "/home/bl231006/WORK/Aymara/lima/lima_linguisticprocessing/src/linguisticProcessing/core/CoreLinguisticProcessingClient.cpp:255: analysis failed : receive status 1 from pipeline. exit"
: XML::DocumentsReader : 2021-12-09T15:26:39.782 ERROR 0x55e55a2fe5a0 StructuredDocumentXMLParser::endElement: error while handeling indexing element "engTEXT" absolute offset: 10389
: LP::PosTagger : 2021-12-09T15:26:41.809 ERROR 0x55e55a2fe5a0 Error in SVMTagger result alignement with analysis graph: got ' . : SENT ' from SVMTagger and ' "\n\n." ' from graph
: LP::CoreClient : 2021-12-09T15:26:41.810 ERROR 0x55e55a2fe5a0 "/home/bl231006/WORK/Aymara/lima/lima_linguisticprocessing/src/linguisticProcessing/core/CoreLinguisticProcessingClient.cpp:255: analysis failed : receive status 1 from pipeline. exit"
: XML::DocumentsReader : 2021-12-09T15:26:41.810 ERROR 0x55e55a2fe5a0 StructuredDocumentXMLParser::endElement: error while handeling indexing element "engTEXT" absolute offset: 52927
: LP::PosTagger : 2021-12-09T15:26:41.940 ERROR 0x55e55a2fe5a0 Error in SVMTagger result alignement with analysis graph: got ' .5 : NOUN ' from SVMTagger and ' "\n.5" ' from graph
: LP::CoreClient : 2021-12-09T15:26:41.940 ERROR 0x55e55a2fe5a0 "/home/bl231006/WORK/Aymara/lima/lima_linguisticprocessing/src/linguisticProcessing/core/CoreLinguisticProcessingClient.cpp:255: analysis failed : receive status 1 from pipeline. exit"
: XML::DocumentsReader : 2021-12-09T15:26:41.940 ERROR 0x55e55a2fe5a0 StructuredDocumentXMLParser::endElement: error while handeling indexing element "engTEXT" absolute offset: 55901
: LP::PosTagger : 2021-12-09T15:26:41.969 ERROR 0x55e55a2fe5a0 Error in SVMTagger result alignement with analysis graph: got ' . : SENT ' from SVMTagger and ' "\n\n." ' from graph
: LP::CoreClient : 2021-12-09T15:26:41.969 ERROR 0x55e55a2fe5a0 "/home/bl231006/WORK/Aymara/lima/lima_linguisticprocessing/src/linguisticProcessing/core/CoreLinguisticProcessingClient.cpp:255: analysis failed : receive status 1 from pipeline. exit"
: XML::DocumentsReader : 2021-12-09T15:26:41.970 ERROR 0x55e55a2fe5a0 StructuredDocumentXMLParser::endElement: error while handeling indexing element "engTEXT" absolute offset: 56480
Total: 5317 ms
02552_GS_RC_MEC_682_EN_00.xml.zip
from lima.
The recovery on error is handled in Release mode thanks to the fix on WITH_DEBUG_MESSAGES in commit e8e2e11 .
This allows to process large XML files where each page is a node (engText) with a minimized impact on the final result
The SVMTag crash is still not solved.
from lima.
Solved in commit 876c293:
gael@brezhoneg2:~/Téléchargements$ echo -e "\n.0" > test.txt; analyzeText --language=eng test.txt
Analyzing 1/1 (100.00%) 'test.txt' : LP::PosTagger : 2021-12-16T00:05:51.228 WARN 0x56012aba6760 Error in SVMTagger. Invalid token with newline(s): "\n.0"
: LP::PosTagger : 2021-12-16T00:05:51.228 WARN 0x56012aba6760 Avoiding the problem but the tokenizer should be checked.
: LP::PosTagger : 2021-12-16T00:05:51.228 WARN 0x56012aba6760 Error in SVMTagger. Invalid token with newline(s): "\n.0"
: LP::PosTagger : 2021-12-16T00:05:51.228 WARN 0x56012aba6760 Avoiding the problem but the tokenizer should be checked.
: LP::PosTagger : 2021-12-16T00:05:51.228 WARN 0x56012aba6760 No matching category found for tagger result ".0" "NOUN"
: LP::PosTagger : 2021-12-16T00:05:51.228 WARN 0x56012aba6760 Taking any one
# global.columns = ID FORM LEMMA UPOS XPOS FEATS HEAD DEPREL DEPS MISC
# sent_id = 1
# text = .0
1 \x0a.0
.0 NUM _ _ _ _ _ NE=I-Numex.NUMBER|Pos=1|Len=3
gael@brezhoneg2:~/Téléchargements$ echo -e "some text.\n.\n" > test.txt; analyzeText --language=eng test.txt
Analyzing 1/1 (100.00%) 'test.txt' : LP::PosTagger : 2021-12-16T00:06:10.981 WARN 0x5614edead760 Error in SVMTagger. Invalid token with newline(s): ".\n."
: LP::PosTagger : 2021-12-16T00:06:10.982 WARN 0x5614edead760 Avoiding the problem but the tokenizer should be checked.
: LP::PosTagger : 2021-12-16T00:06:10.982 WARN 0x5614edead760 Error in SVMTagger. Invalid token with newline(s): ".\n."
: LP::PosTagger : 2021-12-16T00:06:10.982 WARN 0x5614edead760 Avoiding the problem but the tokenizer should be checked.
: LP::PosTagger : 2021-12-16T00:06:10.982 WARN 0x5614edead760 No matching category found for tagger result ".\u200B." "NOUN"
: LP::PosTagger : 2021-12-16T00:06:10.982 WARN 0x5614edead760 Taking any one
# global.columns = ID FORM LEMMA UPOS XPOS FEATS HEAD DEPREL DEPS MISC
# sent_id = 1
# text = some text.
1 some some DET _ _ 2 det _ Pos=1|Len=4
2 text text NOUN _ NUMBER=SING 3 Dummy _ Pos=6|Len=4|SpaceAfter=No
3 .\x0a. .
. SENT _ _ 0 _ _ Pos=10|Len=3
But it does not solve the underlying tokenizer error.
from lima.
Dear @kleag ,
I got a new example that crashes the SVMPosTagger. The malicious characters are the succession of three dots : "..." .
I managed to overcome the issue by replacing in the analyzed text with the unicode 2026 + two spaces : "… ".
from lima.
Related Issues (20)
- Wrong entity string output by the BratDumper HOT 1
- Conflict between libtorch3-dev and lima packages HOT 3
- CI should build a binary version of LIMA based on /pypa/manylinux HOT 13
- Package a Modex and its resources together to facilitate their deployment
- XmlReader fails in case of XML-entities HOT 13
- Shoul port to Qt 6 HOT 2
- Compilation error: fail on test XTestXmlReader0 HOT 7
- Error with AnalyzeText command HOT 6
- Error with pipelines HOT 2
- tvx tests silently fail during GitHub Actions build
- Pb with the TL;DR HOT 3
- Wrong interpretation of xml files analysis configuration HOT 3
- [refactoring] Factories should produce shared pointers instead of raw ones HOT 1
- Deeplima dp train: The list of expected tasks should not be hard-coded
- Add models to docker HOT 3
- Should provide a `docker-compose.yml` file and doc
- lima_models.py bug with zho-simp HOT 18
- Should implement the evaluate method for deeplima lemmatization training
- Should print plugin login failure message when in error HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from lima.