aymara / lima Goto Github PK
View Code? Open in Web Editor NEWThe Libre Multilingual Analyzer, a Natural Language Processing (NLP) C++ toolkit.
Home Page: http://aymara.github.io/lima/
License: Other
The Libre Multilingual Analyzer, a Natural Language Processing (NLP) C++ toolkit.
Home Page: http://aymara.github.io/lima/
License: Other
analyzeText exits with a LinguisticProcessingException on some unusual (but correct) UTF-8 characters, with an error message :
ERROR HypenWordAlternatives : no token forward !
Les déverbaux peuvent être longs et complexes, par exemple chidí naaʼnaʼí beeʼeld htsoh bikááʼ dah naaznilígíí " char d'assaut " formé de trois éléments principaux :
Exemple :
l'analyse de "La production de lait en Ukraine a augmenté dans les 19 régions du pays" va donner pour "a augmenté" une longueur 10 alors qu'elle devrait être 15 en tenant compte de l'entité é
.
Solution proposée (branche master)
void BoWBinaryWriterPrivate::writeSimpleToken(std::ostream& file,
const boost::shared_ptr< BoWToken > token) const
{
#ifdef DEBUG_LP
BOWLOGINIT;
LDEBUG << "BoWBinaryWriter::writeSimpleToken write lemma: " << &file << token->getLemma();
#endif
Misc::writeUTF8StringField(file,token->getLemma());
#ifdef DEBUG_LP
LDEBUG << "BoWBinaryWriter::writeSimpleToken write infl: " << token->getInflectedForm();
#endif
Misc::writeUTF8StringField(file,token->getInflectedForm());
Misc::writeCodedInt(file,token->getCategory());
//////////////// CORRECTION /////////////////////
// correction de length qui ne tient pas compte des entitées xml dans le lemme
auto beg = token->getPosition();
auto end = token->getLength() + beg;
if (m_shiftFrom.empty())
{
#ifdef DEBUG_LP
LDEBUG << "BoWBinaryWriter::writeSimpleToken shiftFrom is empty";
#endif
}
else
{
#ifdef DEBUG_LP
LDEBUG << "BoWBinaryWriter::writeSimpleToken shiftFrom from begin" << beg;
LDEBUG << "BoWBinaryWriter::writeSimpleToken shiftFrom from end" << end;
#endif
auto const it1 = m_shiftFrom.lowerBound(beg-1);
if (it1 == m_shiftFrom.constBegin())
{
#ifdef DEBUG_LP
LDEBUG << "BoWBinaryWriter::writeSimpleToken shiftFrom from begin: NO shift";
#endif
}
else
{
#ifdef DEBUG_LP
LDEBUG << "BoWBinaryWriter::writeSimpleToken shiftFrom from begin: shift by" << (it1-1).value();
#endif
beg += (it1-1).value();
}
auto const it2 = m_shiftFrom.lowerBound(end-1);
if (it2 == m_shiftFrom.constBegin())
{
#ifdef DEBUG_LP
LDEBUG << "BoWBinaryWriter::writeSimpleToken shiftFrom from end: NO shift";
#endif
}
else
{
#ifdef DEBUG_LP
LDEBUG << "BoWBinaryWriter::writeSimpleToken shiftFrom from end: shift by" << (it2-1).value();
#endif
end += (it2-1).value();
}
}
Misc::writeCodedInt(file, beg-1);
Misc::writeCodedInt(file, end-beg);
///////////////////////// FIN CORRECTION ///////////////////////////////
/* Code remplacé
if (m_shiftFrom.empty())
{
#ifdef DEBUG_LP
LDEBUG << "BoWBinaryWriter::writeSimpleToken shiftFrom is empty";
#endif
Misc::writeCodedInt(file,token->getPosition()-1);
}
else
{
#ifdef DEBUG_LP
LDEBUG << "BoWBinaryWriter::writeSimpleToken shiftFrom from" << token->getPosition();
#endif
QMap<uint64_t,uint64_t>::const_iterator it = m_shiftFrom.lowerBound(token->getPosition()-1);
if (it == m_shiftFrom.constBegin())
{
#ifdef DEBUG_LP
LDEBUG << "BoWBinaryWriter::writeSimpleToken shiftFrom NO shift";
#endif
Misc::writeCodedInt(file,token->getPosition()-1);
}
else
{
#ifdef DEBUG_LP
LDEBUG << "BoWBinaryWriter::writeSimpleToken shiftFrom shift by" << (it-1).value();
#endif
Misc::writeCodedInt(file,token->getPosition()+ (it-1).value()-1);
}
}
Misc::writeCodedInt(file,token->getLength());
#endif
*/
}
FeatureStoredData::getValue implementation seems wrong: it returns token position instead of a stored feature value.
Furthermore, variable annot is set but not used.
After analyzing "C'est un test.", the output of the analysis deletes "C'est" :
3 un un DET DET _ _ 4 DETSUB _ _
4 test test NC NC _ _ _ _ _ _
5 . . PONCTU PONCTU_FORTE _ _ _ _ _ _
Before the syntactic analysis, the analysis graph and the PoS graph are correct (see attached files test-1.txt.bp.dot.png, test-1.txt.dot.png).
After the syntactic analysis, the morpho syntactic data of "C'" and "est" nodes are corrupted (see attached file test-1.txt.afterSA.dot.png).
The process described in issue #3 is possible for analyzeText which is local and at the same time the server and the client. For a network client-server setup, it is still
ncecessary to hard-code the handlers initialisation. To avoid that, it would be
necessary to enrich the analysis client API with an access to this data.
One should be able to define and use a head token in sub-automatons.
Currently, you define and use a subautomaton like that:
define subautomaton NounGroup {
pattern=$DET? ($ADV{0-2} $ADJ){0-2} ($NC){0-2} $NC
}
@InfinitiveVerb::%NounGroup:SYNTACTIC_RELATION:
+!GovernorOf(right.1.4,"ANY")
+GovernedBy(trigger.1,"PrepInf")
+CreateRelationBetween(right.1.4,trigger.1,"COD_V")
=>AddRelationInGraph()
=<ClearStoredRelations()
Thus you have to know the structure of the subautomaton to use it. Instead, you should be able to define a head token in the subautomaton and refer to it by name. The above example would become:
define subautomaton NounGroup {
pattern=$DET? ($ADV{0-2} $ADJ){0-2} ($NC){0-2} $NC
head=4
}
@InfinitiveVerb::%NounGroup:SYNTACTIC_RELATION:
+!GovernorOf(right.head,"ANY")
+GovernedBy(trigger.1,"PrepInf")
+CreateRelationBetween(right.head,trigger.1,"COD_V")
=>AddRelationInGraph()
=<ClearStoredRelations()
analyzeText -l eng test.txt
or
analyzeText -l fre test.txt
with test.txt, a file containing only a dot (with or without a newline), causes a segmentation fault error.
Detailed link error should be posted
Sometimes, particularly with generated rules, it could be useful to allow a case insensitive matching in Modex rules. For example, if a resource lists all entities with a capitalized first token, then one would set all tokens to lowercase even those that are effectively capitalized in the text.
For example, if two entities are "T cell" and "Anatomic pathology procedure" then generated rules will be
t:::cell:X:
anatomic::pathology procedure:Y:
But the lemma of "T" in texts will remain "T", and thus the first rule will not match.
If we can specify that the matching should be case insensitive, this problem would be solved.
Currently, a constraint return true iff:
This means that a constraint will return false if refering to an absent optional element. For example, the following rule will not match for a c:
a:b? c::TYPE:
+Constraint(left.1,"value")
This behavior should be changed as if we explicitly write that an element is optional then we probably want that constraints also allow its absence even if the constraint must be verified when the element is present.
When the change will be made, the documentation (md file) will have to be updated.
Minus characters and digits of unknown words are deleted in normalized forms because the value of unmark and minus of theses characters are empty.
We can resolve this problem by modifying the tokenizerAutomaton-lang.chars.tok file with adding the unmak definition to each character we wan to keep in the normalized form.
For example, in french the following line
0030, DIGIT ZERO, c_5;
can be modified like
0030, DIGIT ZERO, c_5, u0030;
In this case, the unknown characters are deleted.
The ANC MASC corpus is a lot larger than the NLTK WSJ subset that we currently use and it is really free making it easier to distribute.
We have to switch to this. This means mainly adapting it to lima tokenization (idioms and entities handled before learning the PoS tagging model).
In EventTemplateDefinitionResource::getStructure, line 61, a reference to a local stack allocated variable is returned. This is wrong.
When deployed, library names contain 'SOVERSION' instead of an actual number
Currently, when a configuration exception occurs (missing module, group, parameter…), it is hard to know what is the file where this information is missing.
We should add the possibility to access this information. Note that if several files are merged in one configuration, several files could contain the same information. Thus a kind of stack or list of files will have to be handled.
Also, the elements will have to be linked to their parents. What to do for inclusions ?
I just clone the project from github, install pre-requisites including nltk data, set all variables and execute ./gbuild.sh and the build aborts (all details below).
I try to launch "$LIMA_DIST/bin/analyzeText -l eng ~/jva.txt but i had this error message :
: Common::PropertyCode : ERROR 2015-04-20T14:45:31.716 0x11d3cb0 invalid XMLPropertyCode file /home/jean-louis/lima-dist/share/apps/lima/resources/LinguisticProcessings/eng/code-eng.xml
: Common::LanguageData : ERROR 2015-04-20T14:45:31.716 0x11d3cb0 Error while reading PropertyFile file:
terminate called after throwing an instance of 'Lima::InvalidConfiguration'
what():
Aborted (core dumped)
====================================================
DISTRIB (64b) :
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=14.04
DISTRIB_CODENAME=trusty
DISTRIB_DESCRIPTION="Ubuntu 14.04.2 LTS"
NAME="Ubuntu"
VERSION="14.04.2 LTS, Trusty Tahr"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 14.04.2 LTS"
VERSION_ID="14.04"
UNAME:
Linux ubuntu14-lima 3.16.0-30-generic #40~14.04.1-Ubuntu SMP Thu Jan 15 17:43:14 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
==[Variables]==========================================
export JVA=/home/jean-louis
export Qt5_DIR=/opt/qt53
export LIMA_ROOT=$JVA/lima/aymara/lima
LIMA_SOURCES=$LIMA_ROOT/lima/aymara/lima
export LIMA_BUILD_DIR=$LIMA_SOURCES/build
export NLTK_PTB_DP_FILE=$JVA/nltk_data/corpora/dependency_treebank/nltk-ptb.dp
export LINGUISTIC_DATA_ROOT=$LIMA_SOURCES/lima_linguisticData
export LIMA_DIST=$JVA/lima-dist
export LIMA_CONF=$LIMA_DIST/share/config/lima
export LIMA_RESOURCES=$LIMA_DIST/share/apps/lima/resources
export LIMA_EXTERNALS=$LIMA_ROOT/externals
export PATH=$LIMA_DIST/bin:$LIMA_DIST/share/apps/lima/scripts:$PATH
export LD_LIBRARY_PATH=$LIMA_EXTERNALS/lib:$LIMA_DIST/lib:/opt/qt53/lib
===[Compilation end]================================================
-- Up-to-date: /home/jean-louis/lima-dist/share/apps/lima/tests/test-fre.idiom.xml
-- Up-to-date: /home/jean-louis/lima-dist/share/apps/lima/tests/test-fre.sa.xml
-- Up-to-date: /home/jean-louis/lima-dist/share/apps/lima/tests/test-fre.se.xml
-- Up-to-date: /home/jean-louis/lima-dist/share/apps/lima/tests/test-fre.se-PERSON.xml
-- Up-to-date: /home/jean-louis/lima-dist/share/apps/lima/tests/test-fre.simpleword.xml
-- Up-to-date: /home/jean-louis/lima-dist/share/apps/lima/tests/test-fre.tokenizer.xml
-- Up-to-date: /home/jean-louis/lima-dist/share/apps/lima/tests/test-fre.multilevel.xml
[ 52%] Performing test step for 'lima_linguisticprocessing'
make[3]: warning: jobserver unavailable: using -j1. Add `+' to parent make rule.
Running tests...
Test project /home/jean-louis/lima/aymara/lima/lima/aymara/lima/build/master/debug/lima/lima_linguisticprocessing-prefix/src/lima_linguisticprocessing-build
Start 1: BagOfWordsTest0
1/6 Test #1: BagOfWordsTest0 .................. Passed 0.01 sec
Start 2: BagOfWordsTest1
2/6 Test #2: BagOfWordsTest1 .................. Passed 0.02 sec
Start 3: BagOfWordsTest2
3/6 Test #3: BagOfWordsTest2 .................. Passed 0.03 sec
Start 4: AnnotationGraphTest0
4/6 Test #4: AnnotationGraphTest0 ............. Passed 0.03 sec
Start 5: CharChartTest0
5/6 Test #5: CharChartTest0 ................... Passed 0.06 sec
Start 6: CharChartTestAra
6/6 Test #6: CharChartTestAra ................. Passed 0.02 sec
100% tests passed, 0 tests failed out of 6
Total Test time (real) = 0.17 sec
[ 54%] Completed 'lima_linguisticprocessing'
[ 54%] Built target lima_linguisticprocessing
make: *** [all] Error 2
(comment updated to avoid wrong links to other issues)
Quand j'analyse le texte "12,8" en français, j'obtiens le résultat suivant. Le caractère "," disparaît du lemme.
<?xml-stylesheet type="text/xsl" href="bow.xslt"?>
<MultimediaDocuments>
<node elementName="MEMO">
<node elementName="fre" indexingNode="yes">
<content type="tokens">
<tokens>
<bowNamedEntity id="1" lemma="128" category="8192" categoryString="NC" position="84" length="4" type="Numex.NUMBER">
<parts head="0">
<bowToken id="2" lemma="128" category="8192" categoryString="NC" position="84" length="4"/>
</parts>
<feature name="numvalue" value="12.8"/>
<feature name="value" value="12,8"/>
</bowNamedEntity>
</tokens>
<properties>
<property name="ContentId" type="int" value="1"/>
<property name="type" type="string" value="tokens"/>
</properties>
</content>
<properties>
<property name="ContentId" type="int" value="1"/>
<property name="NodeId" type="int" value="2"/>
<property name="StructureId" type="int" value="1"/>
<property name="offBegPrpty" type="int" value="84"/>
<property name="offEndPrpty" type="int" value="88"/>
<property name="encodPrpty" type="string" value="UTF8"/>
<property name="langPrpty" type="string" value="fre"/>
<property name="srcePrpty" type="string" value="X:\Program Files\AntInno\AntBox\TestAmoseV5\Server\data\doc\indexed\d_3\3947.xml"/>
<property name="indexDatePrpty" type="date" value="20170420"/>
</properties>
</node>
<properties>
<property name="ContentId" type="int" value="0"/>
<property name="NodeId" type="int" value="1"/>
<property name="StructureId" type="int" value="1"/>
<property name="offBegPrpty" type="int" value="60"/>
<property name="offEndPrpty" type="int" value="94"/>
<property name="encodPrpty" type="string" value="UTF8"/>
<property name="identPrpty" type="string" value="3947"/>
<property name="srcePrpty" type="string" value="X:\Program Files\AntInno\AntBox\TestAmoseV5\Server\data\doc\indexed\d_3\3947.xml"/>
<property name="indexDatePrpty" type="date" value="20170420"/>
</properties>
</node>
</MultimediaDocuments>
Le problème vient de la classe CharChart. Je propose la correction dans les 2 méthodes suivantes :
LimaString CharChart::unmarkByString (const LimaChar& c) const
{
...
// ajout
if (result.isEmpty())
result.push_back(c);
// fin ajout
#ifdef DEBUG_LP
LDEBUG << "CharChart::unmarkByString" << result;
#endif
return result;
}
LimaString CharChart::unmark(const LimaString& str) const
{
...
// silently discard invalid character
catch (InvalidCharException) {} <----- LIGNE A SUPPRIMER
catch (InvalidCharException) { desaccented.push_back(str.at(i)); } <----- LIGNE A AJOUTER
}
return desaccented;
}
Ce qui donne :
<?xml-stylesheet type="text/xsl" href="bow.xslt"?>
<MultimediaDocuments>
<node elementName="MEMO">
<node elementName="fre" indexingNode="yes">
<content type="tokens">
<tokens>
<bowNamedEntity id="1" lemma="12,8" category="8192" categoryString="NC" position="84" length="4" type="Numex.NUMBER">
<parts head="0">
<bowToken id="2" lemma="12,8" category="8192" categoryString="NC" position="84" length="4"/>
</parts>
<feature name="numvalue" value="12.8"/>
<feature name="value" value="12,8"/>
</bowNamedEntity>
</tokens>
<properties>
<property name="ContentId" type="int" value="1"/>
<property name="type" type="string" value="tokens"/>
</properties>
</content>
<properties>
<property name="ContentId" type="int" value="1"/>
<property name="NodeId" type="int" value="2"/>
<property name="StructureId" type="int" value="1"/>
<property name="offBegPrpty" type="int" value="84"/>
<property name="offEndPrpty" type="int" value="88"/>
<property name="encodPrpty" type="string" value="UTF8"/>
<property name="langPrpty" type="string" value="fre"/>
<property name="srcePrpty" type="string" value="X:\Program Files\AntInno\AntBox\TestAmoseV5\Server\data\doc\indexed\d_3\3947.xml"/>
<property name="indexDatePrpty" type="date" value="20170420"/>
</properties>
</node>
<properties>
<property name="ContentId" type="int" value="0"/>
<property name="NodeId" type="int" value="1"/>
<property name="StructureId" type="int" value="1"/>
<property name="offBegPrpty" type="int" value="60"/>
<property name="offEndPrpty" type="int" value="94"/>
<property name="encodPrpty" type="string" value="UTF8"/>
<property name="identPrpty" type="string" value="3947"/>
<property name="srcePrpty" type="string" value="X:\Program Files\AntInno\AntBox\TestAmoseV5\Server\data\doc\indexed\d_3\3947.xml"/>
<property name="indexDatePrpty" type="date" value="20170420"/>
</properties>
</node>
</MultimediaDocuments>
Replace the management of dumpers and handlers in analyzeText: given a
pipeline, it is possible to check the language configuration file to retrieve
the active dumpers and then the handlers they need (name and class id). One can
then instantiate the handlers and give them to the client.
The result of building the code of Aymara are 2 packages lima_common, and lima_linguisticprocessing.
From the same code, we need to build the 4 following packages : lima_common, lima_common-dev, lima_linguisticprocessing and lima_linguisticprocessing-dev
lima_common would contain library and binaries.
lima_common-dev would contain header files.
Such more modular packaging would be usefull to perform efficient deployment for horizontal scaleability .
Is it a bug ?
/home/gael/Projets/Amose/amose-install/AMOSE/SourcesLima/lima_linguisticprocessing/tools/automatonCompiler/libautomatonCompiler/recognizerCompiler.cpp:751:12: warning: enumeration value ‘T_ENTITY_GROUP’ not handled in switch [-Wswitch]
Should T_ENTITY_GROUP be handled or explicitly ignored ?
Quand j'analyse un doc qui contient le texte "chat, chien", le résultat est le suivant. Un terme "," est ajouté alors que ça ne devrait pas être le cas.
<?xml-stylesheet type="text/xsl" href="bow.xslt"?>
<MultimediaDocuments>
<node elementName="MEMO">
<node elementName="fre" indexingNode="yes">
<content type="tokens">
<tokens>
<bowToken id="1" lemma="chat" category="8192" categoryString="NC" position="84" length="4"/>
<bowToken id="2" lemma="," category="8192" categoryString="NC" position="88" length="1"/>
<bowToken id="3" lemma="chien" category="8192" categoryString="NC" position="90" length="5"/>
</tokens>
...
</content>
...
</node>
...
</node>
</MultimediaDocuments>
Le problème est réglé en corrigeant le fichier tokenizerAutomaton-fre.tok
...
(ALL_LOWER) {
...
- c_del1|c_comma|c_slash|c_hyphen|c_quote|c_percent|c_fraction|m_line = DELIMITER (T_ALPHA,T_SMALL) <---- REMPLACER "(T_ALPHA,T_SMALL)" par "(T_WORD_BRK)"
...
}
...
On obtient :
<?xml-stylesheet type="text/xsl" href="bow.xslt"?>
<MultimediaDocuments>
<node elementName="MEMO">
<node elementName="fre" indexingNode="yes">
<content type="tokens">
<tokens>
<bowToken id="1" lemma="chat" category="8192" categoryString="NC" position="84" length="4"/>
<bowToken id="2" lemma="chien" category="8192" categoryString="NC" position="90" length="5"/>
</tokens>
...
</content>
...
</node>
...
</node>
</MultimediaDocuments>
After named entities, we get for "1234 3.2 4,5":
<specific_entities>
<specific_entity>
<string>1234 3.2</string>
<position>1</position>
<length>8</length>
<type>Numex.NUMBER</type>
</specific_entity>
<specific_entity>
<string>1234 3.2 4,5</string>
<position>1</position>
<length>12</length>
<type>Numex.NUMBER</type>
</specific_entity>
</specific_entities>
while we should get three different entities.
Modex rules can be improved but not completly because we cannot have a numeric transition on real numbers, only on integers.
I tried to change the code to allow transitions on real numbers but it does not work. My try is on branch AutomatonTransitionOnDouble. I probably forgot to change something somewhere but I cannot figure out.
.
The Modex automaton testAllVertices parameters is no more used in the code. Remove it from the API and all the documentation (Doxygen, Wiki, …)
There are errors with some lemmas in the dictionary, e.g. "vous" is lemmatized as "cla" or "cln" (with pos-tag CLS): I guess these are categories instead of lemmas. This should be corrected in the generation of the source of the dictionary.
Hi,
I'm trying to build Aymara using the provided Travis.yml script.
Line 14, apt-get update command seems to have some problems, here is the output sample :
Ign http://ubuntu.mirrors.ovh.net trusty InRelease
Ign http://ppa.launchpad.net trusty InRelease
Ign http://ubuntu.mirrors.ovh.net trusty-updates InRelease
Ign http://ubuntu.mirrors.ovh.net trusty-backports InRelease
Ign http://ppa.launchpad.net trusty InRelease
...
Atteint http://security.ubuntu.com trusty-security/restricted Sources
Atteint http://ubuntu.mirrors.ovh.net trusty-updates/universe amd64 Packages
Err http://ppa.launchpad.net trusty/main amd64 Packages
404 Not Found
Atteint http://ubuntu.mirrors.ovh.net trusty-updates/multiverse amd64 Packages
Err http://ppa.launchpad.net trusty/main i386 Packages
404 Not Found
Atteint http://ubuntu.mirrors.ovh.net trusty-updates/main i386 Packages
Atteint http://ubuntu.mirrors.ovh.net trusty-updates/restricted i386 Packages
...
Atteint http://security.ubuntu.com trusty-security/universe i386 Packages
Atteint http://security.ubuntu.com trusty-security/multiverse i386 Packages
Atteint http://security.ubuntu.com trusty-security/main Translation-en
Atteint http://security.ubuntu.com trusty-security/multiverse Translation-en
Atteint http://security.ubuntu.com trusty-security/restricted Translation-en
Atteint http://security.ubuntu.com trusty-security/universe Translation-en
W: Impossible de récupérer http://ppa.launchpad.net/beineri/opt-qt532/ubuntu/dists/trusty/main/binary-amd64/Packages 404 Not Found
W: Impossible de récupérer http://ppa.launchpad.net/beineri/opt-qt532/ubuntu/dists/trusty/main/binary-i386/Packages 404 Not Found
E: Le téléchargement de quelques fichiers d'index a échoué, ils ont été ignorés, ou les anciens ont été utilisés à la place.
In the fourth sentence of the test-conll.txt file, token 26 is referenced in syntactic dependencies but don't exist in the list of tokens, which has a token 25 and a token 27 but none token 26. The text.txt file is the initial file.
The conll dumper should be enriched to allow the inclusion of coreference information.
In fact, it should be configurable to include or not each kind of information.
An option should also allow to output a header line with information on each column.
Thanks to xtannier for his suggestion.
Currently, the conll dumper uses static mappings to map LIMA tags and relation names to CONLL ones. This should be done optionnaly and by default only output native LIMA tags and relation names. This would avoid outdated and incomplete mappings.
In the fullXml output, the annotations (in the AnnotationGraph) concerning coreferences are incomplete. They miss the id of the refered token.
Thanks to xtannier for reporting.
This rule
@Number:(+|-)?:@Number{0-3} %?:NUMBER:=>NormalizeNumber()
is supposed to concatenate at least a series of 4 numbers and tag them as an entity NUMBER.
When LIMA analyses this example of text: 6 98 88 55 45 42 15. It concatenates all this sequence (a series of 7 numbers) as an entity NUMBER.
We should normally have two entities:
6 98 88 55
45 42 15
There is a bug in the automaton.
When the editions of the rules for the extraction of components were started, it did not duplicate the rule file but simply worked on the LOCATION-fre.rules file on a branch.
When this work continued later (and especially to the merge and push), the LOCATION_COMP-fre.rules file was created but LOCATION-fre.rules file was not restored to its original state (before changes for components extraction).
We must therefore eliminate any "contamination" of the LOCATION-fre.rules file with component extraction operations, (everything that has a side effect on entities limits).
It is probably not necessary to checkout the LOCATION-fre.rules file to its state before the changes. We risk losing the corrections that have been made over since.
This is a probably not a big job. It must be done both for French and English.
(translated and adapted from OM explanations)
La méthode QChar::category() renvoit un indice qui permet de déterminer le libellé de la catégory unicode du caractère via le vecteur m_unicodeCategories.
Problème : les indices sont décalés de 1 du fait qu'il manque la catégorie "NoCategory".
La méthode const CharClass* CharChart::charClass (const LimaChar& c) const donne un résultat erroné.
Solution : ajouter la catégorie manquante.
m_unicodeCategories
<< "NoCategory" <----- AJOUT catégorie manquante
<< "Mark_NonSpacing"
<< "Mark_SpacingCombining"
<< "Mark_Enclosing"
<< "Number_DecimalDigit"
<< "Number_Letter"
<< "Number_Other"
<< "Separator_Space"
<< "Separator_Line"
<< "Separator_Paragraph"
<< "Other_Control"
<< "Other_Format"
<< "Other_Surrogate"
<< "Other_PrivateUse"
<< "Other_NotAssigned"
<< "Letter_Uppercase"
<< "Letter_Lowercase"
<< "Letter_Titlecase"
<< "Letter_Modifier"
<< "Letter_Other"
<< "Punctuation_Connector"
<< "Punctuation_Dash"
<< "Punctuation_Open"
<< "Punctuation_Close"
<< "Punctuation_InitialQuote"
<< "Punctuation_FinalQuote"
<< "Punctuation_Other"
<< "Symbol_Math"
<< "Symbol_Currency"
<< "Symbol_Modifier"
<< "Symbol_Other";
Currently, the use keyword allows to list classes in external files to define gazeteers and the include keyword allows to compile rules from files external to the current rules file.
Also, it is possible to define subautomatons in a file that are used in several rules of this file.
But it is not possible to share subautomatons between rules files. This would be useful to avoid duplication and help the maintenance of rules files. In this case, error reporting should take this inclusion into account.
Currently, during resources building unaccented entries are not built (with the unaccent.pl script for example). This means that words with wrong accentuation are not recognized as they were in old LIMA versions.
Should we implement that again or just count on the orthographic correction step ?
This old method allowed to recognize strings like "un" or "UN" as instances of "U.N.".
There is no mention in the README or wiki about specific strengthes or weaknesses.
When should I use LIMA instead of other FOSS tools? This is the first question new users will ask themselves when discovering the tool, I think.
Libraries are deployed as liblima_xxxx.so.SOVERSION instead of using an actual version number.
While correcting issue #19, we introduced a mapping between Unicode character classes as defined
by Qt and our generic class names.
We should also have a mean to define the mapping in our tokenizer char table.
We have some problems with number recognitions due to some changes in automaton code.
While testing only this rule
Number::@Number:NUMBER:=>NormalizeNumber() supposed to recognize a serie of two separed numbers as one Specific Entity with NUMBER type.
Lima analyzed this text:
6 98 88 32 45 44 88 44 88 444 88 110 111 112 223 555 888 777 111 11 12 1 2
and recognized the following entities that contain more than two numbers:
14Numex.NUMBER6 98 06 98 17Numex.NUMBER6 98 88 886 98 88 110Numex.NUMBER6 98 88 32 06 98 88 32 113Numex.NUMBER6 98 88 32 45 456 98 88 32 45 116Numex.NUMBER6 98 88 32 45 44 06 98 88 32 45 44 119Numex.NUMBER6 98 88 32 45 44 88 886 98 88 32 45 44 88 122Numex.NUMBER6 98 88 32 45 44 88 44 06 98 88 32 45 44 88 44 125Numex.NUMBER6 98 88 32 45 44 88 44 88 886 98 88 32 45 44 88 44 88 129Numex.NUMBER6 98 88 32 45 44 88 44 88 444 06 98 88 32 45 44 88 44 88 444 132Numex.NUMBER6 98 88 32 45 44 88 44 88 444 88 886 98 88 32 45 44 88 44 88 444 88 136Numex.NUMBER6 98 88 32 45 44 88 44 88 444 88 110 06 98 88 32 45 44 88 44 88 444 88 110 140Numex.NUMBER6 98 88 32 45 44 88 44 88 444 88 110 111 1116 98 88 32 45 44 88 44 88 444 88 110 111 144Numex.NUMBER6 98 88 32 45 44 88 44 88 444 88 110 111 112 06 98 88 32 45 44 88 44 88 444 88 110 111 112 148Numex.NUMBER6 98 88 32 45 44 88 44 88 444 88 110 111 112 223 2236 98 88 32 45 44 88 44 88 444 88 110 111 112 223 152Numex.NUMBER6 98 88 32 45 44 88 44 88 444 88 110 111 112 223 555 06 98 88 32 45 44 88 44 88 444 88 110 111 112 223 555 156Numex.NUMBER6 98 88 32 45 44 88 44 88 444 88 110 111 112 223 555 888 8886 98 88 32 45 44 88 44 88 444 88 110 111 112 223 555 888 160Numex.NUMBER6 98 88 32 45 44 88 44 88 444 88 110 111 112 223 555 888 777 06 98 88 32 45 44 88 44 88 444 88 110 111 112 223 555 888 777 164Numex.NUMBER6 98 88 32 45 44 88 44 88 444 88 110 111 112 223 555 888 777 111 1116 98 88 32 45 44 88 44 88 444 88 110 111 112 223 555 888 777 111 167Numex.NUMBER6 98 88 32 45 44 88 44 88 444 88 110 111 112 223 555 888 777 111 11 06 98 88 32 45 44 88 44 88 444 88 110 111 112 223 555 888 777 111 11 170Numex.NUMBER6 98 88 32 45 44 88 44 88 444 88 110 111 112 223 555 888 777 111 11 12 126 98 88 32 45 44 88 44 88 444 88 110 111 112 223 555 888 777 111 11 12 172Numex.NUMBER6 98 88 32 45 44 88 44 88 444 88 110 111 112 223 555 888 777 111 11 12 1 06 98 88 32 45 44 88 44 88 444 88 110 111 112 223 555 888 777 111 11 12 1 174Numex.NUMBER6 98 88 32 45 44 88 44 88 444 88 110 111 112 223 555 888 777 111 11 12 1 2 26 98 88 32 45 44 88 44 88 444 88 110 111 112 223 555 888 777 111 11 12 1 2@Number=(
t_comma_number,t_dot_number,t_integer,
deux$NC,
trois$NC,
quatre$NC,
cinq$NC,
six$NC,
sept$NC,
huit$NC,
neuf$NC,
dix$NC,
onze$NC,
douze$NC,
treize$NC,
quatorze$NC,
quinze$NC,
seize$NC,
dix-sept$NC,
dix-huit$NC,
dix-neuf$NC,
vingt$NC,
vingts$NC,
vingt-deux$NC,
vingt-trois$NC,
vingt-quatre$NC,
vingt-cinq$NC,
vingt-six$NC,
vingt-sept$NC,
vingt-huit$NC,
vingt-neuf$NC,
trente$NC,
trente-deux$NC,
trente-trois$NC,
trente-quatre$NC,
trente-cinq$NC,
trente-six$NC,
trente-sept$NC,
trente-huit$NC,
trente-neuf$NC,
quarante$NC,
quarante-deux$NC,
quarante-trois$NC,
quarante-quatre$NC,
quarante-cinq$NC,
quarante-six$NC,
quarante-sept$NC,
quarante-huit$NC,
quarante-neuf$NC,
cinquante$NC,
cinquante-deux$NC,
cinquante-trois$NC,
cinquante-quatre$NC,
cinquante-cinq$NC,
cinquante-six$NC,
cinquante-sept$NC,
cinquante-huit$NC,
cinquante-neuf$NC,
soixante$NC,
soixante-deux$NC,
soixante-trois$NC,
soixante-quatre$NC,
soixante-cinq$NC,
soixante-six$NC,
soixante-sept$NC,
soixante-huit$NC,
soixante-neuf$NC,
septante$NC,
septante-deux$NC,
septante-trois$NC,
septante-quatre$NC,
septante-cinq$NC,
septante-six$NC,
septante-sept$NC,
septante-huit$NC,
septante-neuf$NC,
soixante-dix$NC,
soixante-douze$NC,
soixante-treize$NC,
soixante-quatorze$NC,
soixante-quinze$NC,
soixante-seize$NC,
soixante-dix-sept$NC,
soixante-dix-huit$NC,
soixante-dix-neuf$NC,
huitante$NC,
huitante-deux$NC,
huitante-trois$NC,
huitante-quatre$NC,
huitante-cinq$NC,
huitante-six$NC,
huitante-sept$NC,
huitante-huit$NC,
huitante-neuf$NC,
octante$NC,
octante-deux$NC,
octante-trois$NC,
octante-quatre$NC,
octante-cinq$NC,
octante-six$NC,
octante-sept$NC,
octante-huit$NC,
octante-neuf$NC,
quatre-vingt$NC,
quatre-vingts$NC,
quatre-vingt-un$NC,
quatre-vingt-deux$NC,
quatre-vingt-trois$NC,
quatre-vingt-quatre$NC,
quatre-vingt-cinq$NC,
quatre-vingt-six$NC,
quatre-vingt-sept$NC,
quatre-vingt-huit$NC,
quatre-vingt-neuf$NC,
nonante$NC,
nonante-deux$NC,
nonante-trois$NC,
nonante-quatre$NC,
nonante-cinq$NC,
nonante-six$NC,
nonante-sept$NC,
nonante-huit$NC,
nonante-neuf$NC,
quatre-vingt-dix$NC,
quatre-vingt-onze$NC,
quatre-vingt-douze$NC,
quatre-vingt-treize$NC,
quatre-vingt-quatorze$NC,
quatre-vingt-quinze$NC,
quatre-vingt-seize$NC,
quatre-vingt-dix-sept$NC,
quatre-vingt-dix-huit$NC,
quatre-vingt-dix-neuf$NC,
cent$NC,
cents$NC,
mille$NC
)
@OrdNumber=(
billionième$NC,
centième$NC,
cinquantième$NC,
cinquième$NC,
deuxième$NC,
dixième$NC,
douzième$NC,
huitantième$NC,
huitième$NC,
milliardième$NC,ecificEntities/NUMBER-fre.bin
millionième$NC,
millième$NC,
neuvième$NC,
onzième$NC,
premier$NC,
quarantième$NC,
quatorzième$NC,
quatre-vingtième$NC,
quatrième$NC,
quinzième$NC,
seizième$NC,
septantième$NC,
septième$NC,
sixième$NC,
soixantième$NC,
ter$NC,
treizième$NC,
trentième$NC,
trillionème$NC,
troisième$NC,
unième$NC,
vingtième$NC
)
The rule below is wrong. Either there should be parentheses around t_capital_1st
or the {1-3}
should be moved out of the group:
@Street::,? (de la|de|du|des|à|aux)? ($NC|$NP|t_capital_1st{1-3}):LOCATION:
But the parser silently accepts it and produces an automaton which matches wrongly and produces a corrupted analyses graph. When analyzing Cette maison est la plus belle de la rue.
, "rue" is wrongly matched and replaced by a token with no linguistic data (see graph below)
In the test-conll.txt file, the 25th token of the third sentence has <9 octobre> as form and < octobre> as lemma. The number of the date, actually identified as a date entity by LIMA, has disappeared in the lemma.
The first rule below compiles and works as expected while the second one fails to compile wih the message trying to get a subpart in a unit element
. The second rule compiles and works as needed when replacing right.1.3
by right.3
. This is false and unexpected.
define subautomaton NounGroup {
pattern=$DET? (@Adverb{0-2} @Adj|@Substantif|@ConjCoord|@Participe|@DetNum|@PrepComp){0-n} @Substantif
}
@Copule:@OpenQuot %NounGroup (@Adj){0-n} @ClosQuot:(@Adverb){0-2} @PastParticiple:SYNTACTIC_RELATION:
+!GovernorOf(left.1,"ANY")
+SecondUngovernedBy(left.2.3,right.2,"ANY")
+CreateRelationBetween(left.2.3,right.2,"SUJ_V")
=>AddRelationInGraph()
=<ClearStoredRelations()
@DetNum::%NounGroup:SYNTACTIC_RELATION:
+SecondUngovernedBy(trigger.1,right.1.3,"ANY")
+CreateRelationBetween(trigger.1,right.1.3,"det")
=>AddRelationInGraph()
=<ClearStoredRelations()
START state should only have ignore (/) actions as it should have no previous character.
Therefore, there should be no state coming back to START.
Bug when folder names have accents. LimaConf files are not found.
Unable to open qslog configuration file:
/home/administrateur/Téléchargements/LivraisonMai/Dist/share/config/amose/log4cpp.properties
Configure Problem
In "Histoire de la seconde guerre mondiale.", the following rule should match (from EVENT-fre.rules):
guerre$NC:(seconde$ADJ|deuxième$ADJ):mondiale$ADJ:EVENT:seconde guerre mondiale
It does not. It works if tags are removed.
The tokens before SpecificEntities:
<vertex id="5">
<token>
<string>seconde</string>
<position>16</position>
<length>7</length>
<t_status>
<t_alpha>
<t_alpha_capital>t_small</t_alpha_capital>
</t_alpha>
<t_default>t_small</t_default>
</t_status>
</token>
<data>
<simple_word>
<form infl="seconde" lemma="second" norm="second">
<property>
<p prop="GENDER" val="FEM"/>
<p prop="MACRO" val="ADJ"/>
<p prop="MICRO" val="ADJ"/>
<p prop="NUMBER" val="SING"/>
</property>
<property>
<p prop="GENDER" val="FEM"/>
<p prop="MACRO" val="NC"/>
<p prop="MICRO" val="NC"/>
<p prop="NUMBER" val="SING"/>
</property>
</form>
<form infl="seconde" lemma="seconde" norm="seconde">
<property>
<p prop="MACRO" val="ADJ"/>
<p prop="MICRO" val="ADJ"/>
</property>
<property>
<p prop="GENDER" val="FEM"/>
<p prop="MACRO" val="NC"/>
<p prop="MICRO" val="NC"/>
<p prop="NUMBER" val="SING"/>
</property>
</form>
<form infl="seconde" lemma="seconder" norm="seconder">
<property>
<p prop="MACRO" val="V"/>
<p prop="MICRO" val="V"/>
<p prop="NUMBER" val="SING"/>
<p prop="PERSON" val="3"/>
<p prop="SYNTAX" val="INTRANS"/>
<p prop="TIME" val="PRES"/>
</property>
<property>
<p prop="MACRO" val="V"/>
<p prop="MICRO" val="VIMP"/>
<p prop="NUMBER" val="SING"/>
<p prop="PERSON" val="2"/>
<p prop="SYNTAX" val="INTRANS"/>
<p prop="TIME" val="PRES"/>
</property>
</form>
</simple_word>
</data>
</vertex>
<vertex id="6">
<token>
<string>guerre</string>
<position>24</position>
<length>6</length>
<t_status>
<t_alpha>
<t_alpha_capital>t_small</t_alpha_capital>
</t_alpha>
<t_default>t_small</t_default>
</t_status>
</token>
<data>
<simple_word>
<form infl="guerre" lemma="guerre" norm="guerre">
<property>
<p prop="GENDER" val="FEM"/>
<p prop="MACRO" val="NC"/>
<p prop="MICRO" val="NC"/>
<p prop="NUMBER" val="SING"/>
</property>
</form>
</simple_word>
</data>
</vertex>
<vertex id="7">
<token>
<string>mondiale</string>
<position>31</position>
<length>8</length>
<t_status>
<t_alpha>
<t_alpha_capital>t_small</t_alpha_capital>
</t_alpha>
<t_default>t_small</t_default>
</t_status>
</token>
<data>
<simple_word>
<form infl="mondiale" lemma="mondial" norm="mondial">
<property>
<p prop="GENDER" val="FEM"/>
<p prop="MACRO" val="ADJ"/>
<p prop="MICRO" val="ADJ"/>
<p prop="NUMBER" val="SING"/>
</property>
</form>
</simple_word>
</data>
</vertex>
Needed to avoid forcing people linking with LIMA to change their main.cpp
Thus, there will be no dependency to Qt at the source level.
Some possible tracks:
http://stackoverflow.com/questions/2150488/using-a-qt-based-dll-in-a-non-qt-application
http://stackoverflow.com/questions/14067431/signal-slot-connections-without-qapplication-or-qcoreapplication
L'analyse du document ci-dessous jette une exception à la ligne 65 de lima_linguisticprocessing\src\linguisticProcessing\core\MorphologicAnalysis\AccentedConcatenatedDataHandler.cpp
namespace Lima { namespace LinguisticProcessing { namespace MorphologicAnalysis {
AccentedConcatenatedDataHandler::AccentedConcatenatedDataHandler(LinguisticGraph* outputGraph,
const LimaString& sourceStr,
uint64_t positionOffset,
const TStatus& status,
LinguisticAnalysisStructure::MorphoSyntacticType type,
const FsaStringsPool* sp,
FlatTokenizer::CharChart* charChart) :
m_graph(outputGraph),
m_srcStr(sourceStr),
m_positionOffset(positionOffset),
m_status(status),
m_stringsPool(sp),
m_charChart(charChart),
m_concatVertices(),
m_currentToken(0),
m_currentData(0),
m_currentElement()
{
m_currentElement.type=type;
std::vector<unsigned char> mapping;
LimaString desacc=m_charChart->unmarkWithMapping(m_srcStr,mapping);
m_unmarkToTextMapping.resize(mapping.size()+1);
unsigned char i=0;
for (std::vector<unsigned char>::const_iterator it=mapping.begin();
it!=mapping.end();
it++,i++)
{
m_unmarkToTextMapping[*it]=i; <<<<<<< EXCEPTION ICI
}
*it vaut 6 pour un tableau m_unmarkToTextMapping qui a une taille de 6 (0 à 5)
Ci-dessous la stacktrace partielle
msvcp100d.dll!std::_Debug_message(const wchar_t * message=0x000007fee78b1298, const wchar_t * file=0x000007fee78aed50, unsigned int line=932) Ligne 15 C++
lima-lp-morphologicanalysis.dll!std::vector<unsigned char,std::allocator<unsigned char> >::operator[](unsigned __int64 _Pos=6) Ligne 933 C++
lima-lp-morphologicanalysis.dll!Lima::LinguisticProcessing::MorphologicAnalysis::AccentedConcatenatedDataHandler::AccentedConcatenatedDataHandler(boost::adjacency_list<boost::vecS,boost::vecS,boost::bidirectionalS,boost::property<enum vertex_chain_id_t,std::set<Lima::LinguisticProcessing::LinguisticAnalysisStructure::ChainIdStruct,std::less<Lima::LinguisticProcessing::LinguisticAnalysisStructure::ChainIdStruct>,std::allocator<Lima::LinguisticProcessing::LinguisticAnalysisStructure::ChainIdStruct> >,boost::property<enum boost::vertex_color_t,enum boost::default_color_type,boost::property<enum vertex_data_t,Lima::LinguisticProcessing::LinguisticAnalysisStructure::MorphoSyntacticData *,boost::property<enum vertex_token_t,Lima::LinguisticProcessing::LinguisticAnalysisStructure::Token *,boost::no_property> > > >,boost::no_property,boost::no_property,boost::listS> * outputGraph=0x0000000023db1b90, const QString & sourceStr={...}, unsigned __int64 positionOffset=39122, const Lima::LinguisticProcessing::LinguisticAnalysisStructure::TStatus & status={...}, Lima::LinguisticProcessing::LinguisticAnalysisStructure::MorphoSyntacticType type=SIMPLE_WORD, const Lima::FsaStringsPool * sp=0x00000000032708f0, Lima::LinguisticProcessing::FlatTokenizer::CharChart * charChart=0x00000000207a06b0) Ligne 65 + 0x23 octets C++
lima-lp-morphologicanalysis.dll!Lima::LinguisticProcessing::MorphologicAnalysis::SimpleWord::process(Lima::AnalysisContent & analysis={...}) Ligne 189 + 0xb5 octets C++
lima-common-mediaprocessors.dll!Lima::ProcessUnitPipeline<Lima::MediaProcessUnit>::process(Lima::AnalysisContent & analysis={...}) Ligne 104 + 0x36 octets C++
lima-lp-linguisticprocessing-core.dll!Lima::LinguisticProcessing::CoreLinguisticProcessingClient::analyze(const QString & texte={...}, const std::map<std::basic_string<char,std::char_traits<char>,std::allocator<char> >,std::basic_string<char,std::char_traits<char>,std::allocator<char> >,std::less<std::basic_string<char,std::char_traits<char>,std::allocator<char> > >,std::allocator<std::pair<std::basic_string<char,std::char_traits<char>,std::allocator<char> > const ,std::basic_string<char,std::char_traits<char>,std::allocator<char> > > > > & metaData=[10](("ElementName", "ara"),("FileName", "3967"),("Filename", "3967"),("Lang", "ara"),("StartOffset", "375"),("StartOffsetIndexingNode", "375"),("Type", ""),("docid", "3967"),("filePath", ""),("pipeline", "indexer")), const std::basic_string<char,std::char_traits<char>,std::allocator<char> > & pipelineId="indexer", const std::map<std::basic_string<char,std::char_traits<char>,std::allocator<char> >,Lima::AbstractAnalysisHandler *,std::less<std::basic_string<char,std::char_traits<char>,std::allocator<char> > >,std::allocator<std::pair<std::basic_string<char,std::char_traits<char>,std::allocator<char> > const ,Lima::AbstractAnalysisHandler *> > > & handlers=[1](("xmlDocumentHandler", 0x000000000ef75e40 {m_out=0x0000000024b1c450 })), const std::set<std::basic_string<char,std::char_traits<char>,std::allocator<char> >,std::less<std::basic_string<char,std::char_traits<char>,std::allocator<char> > >,std::allocator<std::basic_string<char,std::char_traits<char>,std::allocator<char> > > > & inactiveUnits=[0](), const Lima::StopAnalyze & stopAnalyze={...}) Ligne 230 + 0x1e octets C++
lima-lp-linguisticprocessing-core.dll!Lima::LinguisticProcessing::CoreLinguisticProcessingClient::analyze(const std::basic_string<char,std::char_traits<char>,std::allocator<char> > & texte="...", const std::map<std::basic_string<char,std::char_traits<char>,std::allocator<char> >,std::basic_string<char,std::char_traits<char>,std::allocator<char> >,std::less<std::basic_string<char,std::char_traits<char>,std::allocator<char> > >,std::allocator<std::pair<std::basic_string<char,std::char_traits<char>,std::allocator<char> > const ,std::basic_string<char,std::char_traits<char>,std::allocator<char> > > > > & metaData=[10](("ElementName", "ara"),("FileName", "3967"),("Filename", "3967"),("Lang", "ara"),("StartOffset", "375"),("StartOffsetIndexingNode", "375"),("Type", ""),("docid", "3967"),("filePath", ""),("pipeline", "indexer")), const std::basic_string<char,std::char_traits<char>,std::allocator<char> > & pipelineId="indexer", const std::map<std::basic_string<char,std::char_traits<char>,std::allocator<char> >,Lima::AbstractAnalysisHandler *,std::less<std::basic_string<char,std::char_traits<char>,std::allocator<char> > >,std::allocator<std::pair<std::basic_string<char,std::char_traits<char>,std::allocator<char> > const ,Lima::AbstractAnalysisHandler *> > > & handlers=[1](("xmlDocumentHandler", 0x000000000ef75e40 {m_out=0x0000000024b1c450 })), const std::set<std::basic_string<char,std::char_traits<char>,std::allocator<char> >,std::less<std::basic_string<char,std::char_traits<char>,std::allocator<char> > >,std::allocator<std::basic_string<char,std::char_traits<char>,std::allocator<char> > > > & inactiveUnits=[0](), const Lima::StopAnalyze & stopAnalyze={...}) Ligne 85 + 0x4c octets C++
There is an important memory leak in normalizeTerm. Maybe due to d-pointers ?
Here is the output of valgrind:
==7602== 6,000 bytes in 50 blocks are definitely lost in loss record 951 of 968 ==7602== at 0x4C2B0E0: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==7602== by 0x94B1690: Lima::Node::Node() (in /data/Projets/AMOSE/Sources/amose-install/AMOSE/DistLima/lib/liblima-common-data.so.SOVERSION) ==7602== by 0x94B2D9A: std::_Rb_tree_iterator > std::_Rb_tree, std::_Select1st >, std::less, std::allocator > >::_M_emplace_hint_unique, std::tuple<> >(std::_Rb_tree_const_iterator >, std::piecewise_construct_t const&, std::tuple&&, std::tuple<>&&) (in /data/Projets/AMOSE/Sources/amose-install/AMOSE/DistLima/lib/liblima-common-data.so.SOVERSION) ==7602== by 0x94B2795: Lima::Structure::addNode(Lima::Node const&) (in /data/Projets/AMOSE/Sources/amose-install/AMOSE/DistLima/lib/liblima-common-data.so.SOVERSION) ==7602== by 0x5B0BFD1: Lima::LinguisticProcessing::BowTextHandler::endAnalysis() (in /data/Projets/AMOSE/Sources/amose-install/AMOSE/DistLima/lib/liblima-lp-analysishandlers.so.2.0.1) ==7602== by 0xAE33810: Lima::DumperStream::~DumperStream() (in /data/Projets/AMOSE/Sources/amose-install/AMOSE/DistLima/lib/liblima-common-mediaprocessors.so.SOVERSION) ==7602== by 0x65FFB60: Lima::LinguisticProcessing::AnalysisDumpers::BowDumper::process(Lima::AnalysisContent&) const (in /data/Projets/AMOSE/Sources/amose-install/AMOSE/DistLima/lib/liblima-lp-analysisdumpers.so.2.0.1) ==7602== by 0x50545AE: Lima::ProcessUnitPipeline::process(Lima::AnalysisContent&) const (in /data/Projets/AMOSE/Sources/amose-install/AMOSE/DistLima/lib/liblima-lp-linguisticprocessing-core.so.2.0.1) ==7602== by 0x5051035: Lima::LinguisticProcessing::CoreLinguisticProcessingClient::analyze(QString const&, std::map, std::allocator > > const&, std::string const&, std::map, std::allocator > > const&, std::set, std::allocator > const&) const (in /data/Projets/AMOSE/Sources/amose-install/AMOSE/DistLima/lib/liblima-lp-linguisticprocessing-core.so.2.0.1) ==7602== by 0x40A1A8: dowork(int, char**) (in /data/Projets/AMOSE/Sources/amose-install/AMOSE/DistLima/bin/normalizeTerm) ==7602== by 0x406DA1: main (in /data/Projets/AMOSE/Sources/amose-install/AMOSE/DistLima/bin/normalizeTerm)
Note: this issue completes issue #50 that was covering several problems including this one.
When analysing "123 45.6 . 12 345.6", we should get three number entities with the correct numeric values:
But we get (simplified):
<type>Numex.NUMBER</type>
<string>123 45.6</string>
<numvalue>0</numvalue>
<type>Numex.NUMBER</type>
<string>12 345.6</string>
<numvalue>0</numvalue>
The changes on branch https://github.com/aymara/lima/tree/AutomatonTransitionOnDouble try to handle the two problems of correctly recognizing the entities and correctly normalizing them. But for an unknown reason, the changes do not work as expected.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.