GithubHelp home page GithubHelp logo

aymara / lima Goto Github PK

View Code? Open in Web Editor NEW
103.0 22.0 20.0 282.35 MB

The Libre Multilingual Analyzer, a Natural Language Processing (NLP) C++ toolkit.

Home Page: http://aymara.github.io/lima/

License: Other

CMake 2.49% HTML 0.15% Shell 0.92% C++ 58.49% QMake 0.01% C 0.49% Assembly 0.05% Python 0.76% Perl 3.89% Awk 0.01% Makefile 0.18% TeX 0.27% CSS 0.01% R 0.01% Roff 31.20% QML 0.96% JavaScript 0.07% Batchfile 0.04% Dockerfile 0.01%
natural-language-processing multilingual powerful free-software linux windows machine-learning nlp python cpp

lima's People

Contributors

anuraag-khare avatar benlabbe avatar bsid avatar clemance avatar deurstann avatar hleborgne avatar jtourille avatar jxmas avatar kleag avatar mitaines avatar mrussotto avatar pquentin avatar romaricb avatar simonmarchal avatar victorbocharov avatar vjern avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lima's Issues

Bug with strange characters

analyzeText exits with a LinguisticProcessingException on some unusual (but correct) UTF-8 characters, with an error message :
ERROR HypenWordAlternatives : no token forward !

Here is an example of such text (from the Wikipedia page on Navajo language):

Les déverbaux peuvent être longs et complexes, par exemple chidí naaʼnaʼí beeʼeld htsoh bikááʼ dah naaznilígíí " char d'assaut " formé de trois éléments principaux :

* chidí naaʼnaʼí " tracteur à chenilles " formé de chidí " voiture " + naaʼnaʼí " chenille " ( Ici le suffixe -diin apparaît sous la forme -dįį-. . Les autres nombres se forment en plaçant dóó baʼąą " et en plus de " entre le chiffre des dizaines et celui des unités, par exemple tádiin dóó baʼąą tʼááłáʼí " trente-et-un " et ashdladiin dóó baʼąą tʼááʼ " cinquante-trois ". On peut également former les numéraux de 41 à 49 de cette manière : " quarante-deux " dízdiin dóó baʼąą naaki ou bien dízdįįnaaki.

BoWBinaryWriterPrivate::writeSimpleToken() : la longueur du lemme est fausse s'il contient des entités xml

Exemple :
l'analyse de "La production de lait en Ukraine a augmenté dans les 19 régions du pays" va donner pour "a augmenté" une longueur 10 alors qu'elle devrait être 15 en tenant compte de l'entité é.

Solution proposée (branche master)

void BoWBinaryWriterPrivate::writeSimpleToken(std::ostream& file,
                 const boost::shared_ptr< BoWToken > token) const
{
#ifdef DEBUG_LP
  BOWLOGINIT;
  LDEBUG << "BoWBinaryWriter::writeSimpleToken write lemma: " << &file << token->getLemma();
#endif
  Misc::writeUTF8StringField(file,token->getLemma());
#ifdef DEBUG_LP
  LDEBUG << "BoWBinaryWriter::writeSimpleToken write infl: " << token->getInflectedForm();
#endif
  Misc::writeUTF8StringField(file,token->getInflectedForm());
  Misc::writeCodedInt(file,token->getCategory());

//////////////// CORRECTION /////////////////////

  // correction de length qui ne tient pas compte des entitées xml dans le lemme
  auto beg = token->getPosition();
  auto end = token->getLength() + beg;

  if (m_shiftFrom.empty())
  {
#ifdef DEBUG_LP
    LDEBUG << "BoWBinaryWriter::writeSimpleToken shiftFrom is empty";
#endif
  }
  else
  {
#ifdef DEBUG_LP
    LDEBUG << "BoWBinaryWriter::writeSimpleToken shiftFrom from begin" << beg;
    LDEBUG << "BoWBinaryWriter::writeSimpleToken shiftFrom from end" << end;
#endif
    auto const it1 = m_shiftFrom.lowerBound(beg-1);
    if (it1 == m_shiftFrom.constBegin())
    {
#ifdef DEBUG_LP
      LDEBUG << "BoWBinaryWriter::writeSimpleToken shiftFrom from begin: NO shift";
#endif
    }
    else
    { 
#ifdef DEBUG_LP
      LDEBUG << "BoWBinaryWriter::writeSimpleToken shiftFrom from begin: shift by" << (it1-1).value();
#endif
      beg += (it1-1).value();
    }
    auto const it2 = m_shiftFrom.lowerBound(end-1);
    if (it2 == m_shiftFrom.constBegin())
    {
#ifdef DEBUG_LP
      LDEBUG << "BoWBinaryWriter::writeSimpleToken shiftFrom from end: NO shift";
#endif
    }
    else
    { 
#ifdef DEBUG_LP
      LDEBUG << "BoWBinaryWriter::writeSimpleToken shiftFrom from end: shift by" << (it2-1).value();
#endif
      end += (it2-1).value();
    }
  }

  Misc::writeCodedInt(file, beg-1);
  Misc::writeCodedInt(file, end-beg);

///////////////////////// FIN CORRECTION ///////////////////////////////

/*  Code remplacé
  if (m_shiftFrom.empty())
  {
#ifdef DEBUG_LP
    LDEBUG << "BoWBinaryWriter::writeSimpleToken shiftFrom is empty";
#endif
    Misc::writeCodedInt(file,token->getPosition()-1);
  }
  else 
  {
#ifdef DEBUG_LP
    LDEBUG << "BoWBinaryWriter::writeSimpleToken shiftFrom from" << token->getPosition();
#endif
    QMap<uint64_t,uint64_t>::const_iterator it = m_shiftFrom.lowerBound(token->getPosition()-1);
    if (it == m_shiftFrom.constBegin())
    {
#ifdef DEBUG_LP
      LDEBUG << "BoWBinaryWriter::writeSimpleToken shiftFrom NO shift";
#endif
      Misc::writeCodedInt(file,token->getPosition()-1);
    }
    else
    {
#ifdef DEBUG_LP
      LDEBUG << "BoWBinaryWriter::writeSimpleToken shiftFrom shift by" << (it-1).value();
#endif
      Misc::writeCodedInt(file,token->getPosition()+ (it-1).value()-1);
    }
  }
  Misc::writeCodedInt(file,token->getLength());
#endif
*/
}

Missing information in PoS graph when analyzing "C'est"

After analyzing "C'est un test.", the output of the analysis deletes "C'est" :
3 un un DET DET _ _ 4 DETSUB _ _
4 test test NC NC _ _ _ _ _ _
5 . . PONCTU PONCTU_FORTE _ _ _ _ _ _

Before the syntactic analysis, the analysis graph and the PoS graph are correct (see attached files test-1.txt.bp.dot.png, test-1.txt.dot.png).
After the syntactic analysis, the morpho syntactic data of "C'" and "est" nodes are corrupted (see attached file test-1.txt.afterSA.dot.png).
test-1 txt aftersa dot
test-1 txt bp dot

test-1 txt dot

Enrich the analysis client API with an access to configuration data

The process described in issue #3 is possible for analyzeText which is local and at the same time the server and the client. For a network client-server setup, it is still
ncecessary to hard-code the handlers initialisation. To avoid that, it would be
necessary to enrich the analysis client API with an access to this data.

Head token in Modex rules sub-automatons

One should be able to define and use a head token in sub-automatons.

Currently, you define and use a subautomaton like that:

define subautomaton NounGroup {
 pattern=$DET? ($ADV{0-2} $ADJ){0-2} ($NC){0-2} $NC
}

@InfinitiveVerb::%NounGroup:SYNTACTIC_RELATION:
+!GovernorOf(right.1.4,"ANY")
+GovernedBy(trigger.1,"PrepInf")
+CreateRelationBetween(right.1.4,trigger.1,"COD_V")
=>AddRelationInGraph()
=<ClearStoredRelations()

Thus you have to know the structure of the subautomaton to use it. Instead, you should be able to define a head token in the subautomaton and refer to it by name. The above example would become:

define subautomaton NounGroup {
 pattern=$DET? ($ADV{0-2} $ADJ){0-2} ($NC){0-2} $NC
 head=4
}

@InfinitiveVerb::%NounGroup:SYNTACTIC_RELATION:
+!GovernorOf(right.head,"ANY")
+GovernedBy(trigger.1,"PrepInf")
+CreateRelationBetween(right.head,trigger.1,"COD_V")
=>AddRelationInGraph()
=<ClearStoredRelations()

Could be useful (albeit costly) to allow a case insensitive matching in Modex rules

Sometimes, particularly with generated rules, it could be useful to allow a case insensitive matching in Modex rules. For example, if a resource lists all entities with a capitalized first token, then one would set all tokens to lowercase even those that are effectively capitalized in the text.
For example, if two entities are "T cell" and "Anatomic pathology procedure" then generated rules will be
t:::cell:X:
anatomic::pathology procedure:Y:

But the lemma of "T" in texts will remain "T", and thus the first rule will not match.

If we can specify that the matching should be case insensitive, this problem would be solved.

Modex should ignore constraints on absent optional elements

Currently, a constraint return true iff:

  • its element(s) is/are found;
  • its function returns true.

This means that a constraint will return false if refering to an absent optional element. For example, the following rule will not match for a c:

a:b? c::TYPE:
    +Constraint(left.1,"value")

This behavior should be changed as if we explicitly write that an element is optional then we probably want that constraints also allow its absence even if the constraint must be verified when the element is present.

When the change will be made, the documentation (md file) will have to be updated.

Some Characters of Unknown words are deleted in lemma and normalized form

Minus characters and digits of unknown words are deleted in normalized forms because the value of unmark and minus of theses characters are empty.

We can resolve this problem by modifying the tokenizerAutomaton-lang.chars.tok file with adding the unmak definition to each character we wan to keep in the normalized form.

For example, in french the following line
0030, DIGIT ZERO, c_5;
can be modified like
0030, DIGIT ZERO, c_5, u0030;

In this case, the unknown characters are deleted.

Switch eng pos tagging learning corpus to OANC

The ANC MASC corpus is a lot larger than the NLTK WSJ subset that we currently use and it is really free making it easier to distribute.

We have to switch to this. This means mainly adapting it to lima tokenization (idioms and entities handled before learning the PoS tagging model).

SOVERSION number

When deployed, library names contain 'SOVERSION' instead of an actual number

Enrich configuration files API to access to file names

Currently, when a configuration exception occurs (missing module, group, parameter…), it is hard to know what is the file where this information is missing.

We should add the possibility to access this information. Note that if several files are merged in one configuration, several files could contain the same information. Thus a kind of stack or list of files will have to be handled.

Also, the elements will have to be linked to their parents. What to do for inclusions ?

Build issue

I just clone the project from github, install pre-requisites including nltk data, set all variables and execute ./gbuild.sh and the build aborts (all details below).
I try to launch "$LIMA_DIST/bin/analyzeText -l eng ~/jva.txt but i had this error message :

 : Common::PropertyCode : ERROR 2015-04-20T14:45:31.716 0x11d3cb0 invalid XMLPropertyCode file  /home/jean-louis/lima-dist/share/apps/lima/resources/LinguisticProcessings/eng/code-eng.xml 
 : Common::LanguageData : ERROR 2015-04-20T14:45:31.716 0x11d3cb0 Error while reading PropertyFile file:   
terminate called after throwing an instance of 'Lima::InvalidConfiguration'
  what():  
Aborted (core dumped)



====================================================
DISTRIB (64b) :
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=14.04
DISTRIB_CODENAME=trusty
DISTRIB_DESCRIPTION="Ubuntu 14.04.2 LTS"
NAME="Ubuntu"
VERSION="14.04.2 LTS, Trusty Tahr"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 14.04.2 LTS"
VERSION_ID="14.04"

UNAME:
Linux ubuntu14-lima 3.16.0-30-generic #40~14.04.1-Ubuntu SMP Thu Jan 15 17:43:14 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

==[Variables]==========================================
export JVA=/home/jean-louis
export Qt5_DIR=/opt/qt53
export LIMA_ROOT=$JVA/lima/aymara/lima
LIMA_SOURCES=$LIMA_ROOT/lima/aymara/lima
export LIMA_BUILD_DIR=$LIMA_SOURCES/build
export NLTK_PTB_DP_FILE=$JVA/nltk_data/corpora/dependency_treebank/nltk-ptb.dp
export LINGUISTIC_DATA_ROOT=$LIMA_SOURCES/lima_linguisticData
export LIMA_DIST=$JVA/lima-dist
export LIMA_CONF=$LIMA_DIST/share/config/lima
export LIMA_RESOURCES=$LIMA_DIST/share/apps/lima/resources
export LIMA_EXTERNALS=$LIMA_ROOT/externals

export PATH=$LIMA_DIST/bin:$LIMA_DIST/share/apps/lima/scripts:$PATH
export LD_LIBRARY_PATH=$LIMA_EXTERNALS/lib:$LIMA_DIST/lib:/opt/qt53/lib

===[Compilation end]================================================
-- Up-to-date: /home/jean-louis/lima-dist/share/apps/lima/tests/test-fre.idiom.xml
-- Up-to-date: /home/jean-louis/lima-dist/share/apps/lima/tests/test-fre.sa.xml
-- Up-to-date: /home/jean-louis/lima-dist/share/apps/lima/tests/test-fre.se.xml
-- Up-to-date: /home/jean-louis/lima-dist/share/apps/lima/tests/test-fre.se-PERSON.xml
-- Up-to-date: /home/jean-louis/lima-dist/share/apps/lima/tests/test-fre.simpleword.xml
-- Up-to-date: /home/jean-louis/lima-dist/share/apps/lima/tests/test-fre.tokenizer.xml
-- Up-to-date: /home/jean-louis/lima-dist/share/apps/lima/tests/test-fre.multilevel.xml
[ 52%] Performing test step for 'lima_linguisticprocessing'
make[3]: warning: jobserver unavailable: using -j1.  Add `+' to parent make rule.
Running tests...
Test project /home/jean-louis/lima/aymara/lima/lima/aymara/lima/build/master/debug/lima/lima_linguisticprocessing-prefix/src/lima_linguisticprocessing-build
    Start 1: BagOfWordsTest0
1/6 Test #1: BagOfWordsTest0 ..................   Passed    0.01 sec
    Start 2: BagOfWordsTest1
2/6 Test #2: BagOfWordsTest1 ..................   Passed    0.02 sec
    Start 3: BagOfWordsTest2
3/6 Test #3: BagOfWordsTest2 ..................   Passed    0.03 sec
    Start 4: AnnotationGraphTest0
4/6 Test #4: AnnotationGraphTest0 .............   Passed    0.03 sec
    Start 5: CharChartTest0
5/6 Test #5: CharChartTest0 ...................   Passed    0.06 sec
    Start 6: CharChartTestAra
6/6 Test #6: CharChartTestAra .................   Passed    0.02 sec

100% tests passed, 0 tests failed out of 6

Total Test time (real) =   0.17 sec
[ 54%] Completed 'lima_linguisticprocessing'
[ 54%] Built target lima_linguisticprocessing
make: *** [all] Error 2

(comment updated to avoid wrong links to other issues)

During the analysis of a document containing a comma the comma is lost

Quand j'analyse le texte "12,8" en français, j'obtiens le résultat suivant. Le caractère "," disparaît du lemme.

<?xml-stylesheet type="text/xsl" href="bow.xslt"?>
<MultimediaDocuments>
  <node elementName="MEMO">
    <node elementName="fre" indexingNode="yes">
      <content type="tokens">
        <tokens>
          <bowNamedEntity id="1" lemma="128" category="8192" categoryString="NC" position="84" length="4" type="Numex.NUMBER">
            <parts head="0">
              <bowToken id="2" lemma="128" category="8192" categoryString="NC" position="84" length="4"/>
            </parts>
            <feature name="numvalue" value="12.8"/>
            <feature name="value" value="12,8"/>
          </bowNamedEntity>
        </tokens>
        <properties>
          <property name="ContentId" type="int" value="1"/>
          <property name="type" type="string" value="tokens"/>
        </properties>
      </content>
      <properties>
        <property name="ContentId" type="int" value="1"/>
        <property name="NodeId" type="int" value="2"/>
        <property name="StructureId" type="int" value="1"/>
        <property name="offBegPrpty" type="int" value="84"/>
        <property name="offEndPrpty" type="int" value="88"/>
        <property name="encodPrpty" type="string" value="UTF8"/>
        <property name="langPrpty" type="string" value="fre"/>
        <property name="srcePrpty" type="string" value="X:\Program Files\AntInno\AntBox\TestAmoseV5\Server\data\doc\indexed\d_3\3947.xml"/>
        <property name="indexDatePrpty" type="date" value="20170420"/>
      </properties>
    </node>
    <properties>
      <property name="ContentId" type="int" value="0"/>
      <property name="NodeId" type="int" value="1"/>
      <property name="StructureId" type="int" value="1"/>
      <property name="offBegPrpty" type="int" value="60"/>
      <property name="offEndPrpty" type="int" value="94"/>
      <property name="encodPrpty" type="string" value="UTF8"/>
      <property name="identPrpty" type="string" value="3947"/>
      <property name="srcePrpty" type="string" value="X:\Program Files\AntInno\AntBox\TestAmoseV5\Server\data\doc\indexed\d_3\3947.xml"/>
      <property name="indexDatePrpty" type="date" value="20170420"/>
    </properties>
  </node>
</MultimediaDocuments>

Le problème vient de la classe CharChart. Je propose la correction dans les 2 méthodes suivantes :

LimaString CharChart::unmarkByString (const LimaChar& c) const
{

...

// ajout
  if (result.isEmpty())
    result.push_back(c);
// fin ajout
#ifdef DEBUG_LP
  LDEBUG << "CharChart::unmarkByString" << result;
#endif
  return result;
}

LimaString CharChart::unmark(const LimaString& str) const
{

...

    // silently discard invalid character
    catch (InvalidCharException) {}  <----- LIGNE A SUPPRIMER
    catch (InvalidCharException) { desaccented.push_back(str.at(i)); }  <----- LIGNE A AJOUTER
  }
  return desaccented;
}

Ce qui donne :

<?xml-stylesheet type="text/xsl" href="bow.xslt"?>
<MultimediaDocuments>
  <node elementName="MEMO">
    <node elementName="fre" indexingNode="yes">
      <content type="tokens">
        <tokens>
          <bowNamedEntity id="1" lemma="12,8" category="8192" categoryString="NC" position="84" length="4" type="Numex.NUMBER">
            <parts head="0">
              <bowToken id="2" lemma="12,8" category="8192" categoryString="NC" position="84" length="4"/>
            </parts>
            <feature name="numvalue" value="12.8"/>
            <feature name="value" value="12,8"/>
          </bowNamedEntity>
        </tokens>
        <properties>
          <property name="ContentId" type="int" value="1"/>
          <property name="type" type="string" value="tokens"/>
        </properties>
      </content>
      <properties>
        <property name="ContentId" type="int" value="1"/>
        <property name="NodeId" type="int" value="2"/>
        <property name="StructureId" type="int" value="1"/>
        <property name="offBegPrpty" type="int" value="84"/>
        <property name="offEndPrpty" type="int" value="88"/>
        <property name="encodPrpty" type="string" value="UTF8"/>
        <property name="langPrpty" type="string" value="fre"/>
        <property name="srcePrpty" type="string" value="X:\Program Files\AntInno\AntBox\TestAmoseV5\Server\data\doc\indexed\d_3\3947.xml"/>
        <property name="indexDatePrpty" type="date" value="20170420"/>
      </properties>
    </node>
    <properties>
      <property name="ContentId" type="int" value="0"/>
      <property name="NodeId" type="int" value="1"/>
      <property name="StructureId" type="int" value="1"/>
      <property name="offBegPrpty" type="int" value="60"/>
      <property name="offEndPrpty" type="int" value="94"/>
      <property name="encodPrpty" type="string" value="UTF8"/>
      <property name="identPrpty" type="string" value="3947"/>
      <property name="srcePrpty" type="string" value="X:\Program Files\AntInno\AntBox\TestAmoseV5\Server\data\doc\indexed\d_3\3947.xml"/>
      <property name="indexDatePrpty" type="date" value="20170420"/>
    </properties>
  </node>
</MultimediaDocuments>

Improve dumpers and handlers situation

Replace the management of dumpers and handlers in analyzeText: given a
pipeline, it is possible to check the language configuration file to retrieve
the active dumpers and then the handlers they need (name and class id). One can
then instantiate the handlers and give them to the client.

build both -dev and non-dev packages

The result of building the code of Aymara are 2 packages lima_common, and lima_linguisticprocessing.
From the same code, we need to build the 4 following packages : lima_common, lima_common-dev, lima_linguisticprocessing and lima_linguisticprocessing-dev
lima_common would contain library and binaries.
lima_common-dev would contain header files.
Such more modular packaging would be usefull to perform efficient deployment for horizontal scaleability .

Possible bug with not handled enumeration switch

Is it a bug ?
/home/gael/Projets/Amose/amose-install/AMOSE/SourcesLima/lima_linguisticprocessing/tools/automatonCompiler/libautomatonCompiler/recognizerCompiler.cpp:751:12: warning: enumeration value ‘T_ENTITY_GROUP’ not handled in switch [-Wswitch]

Should T_ENTITY_GROUP be handled or explicitly ignored ?

The analysis generates tokens with the NC tag for commas. It should not happen

Quand j'analyse un doc qui contient le texte "chat, chien", le résultat est le suivant. Un terme "," est ajouté alors que ça ne devrait pas être le cas.

<?xml-stylesheet type="text/xsl" href="bow.xslt"?>
<MultimediaDocuments>
  <node elementName="MEMO">
    <node elementName="fre" indexingNode="yes">
      <content type="tokens">
        <tokens>
          <bowToken id="1" lemma="chat" category="8192" categoryString="NC" position="84" length="4"/>
          <bowToken id="2" lemma="," category="8192" categoryString="NC" position="88" length="1"/>
          <bowToken id="3" lemma="chien" category="8192" categoryString="NC" position="90" length="5"/>
        </tokens>
...
      </content>
...
    </node>
...
  </node>
</MultimediaDocuments>

Le problème est réglé en corrigeant le fichier tokenizerAutomaton-fre.tok

...
(ALL_LOWER) {
...
  - c_del1|c_comma|c_slash|c_hyphen|c_quote|c_percent|c_fraction|m_line = DELIMITER (T_ALPHA,T_SMALL)  <---- REMPLACER "(T_ALPHA,T_SMALL)" par "(T_WORD_BRK)"
 ...
}
...

On obtient :

<?xml-stylesheet type="text/xsl" href="bow.xslt"?>
<MultimediaDocuments>
  <node elementName="MEMO">
    <node elementName="fre" indexingNode="yes">
      <content type="tokens">
        <tokens>
          <bowToken id="1" lemma="chat" category="8192" categoryString="NC" position="84" length="4"/>
          <bowToken id="2" lemma="chien" category="8192" categoryString="NC" position="90" length="5"/>
        </tokens>
...
      </content>
...
    </node>
...
  </node>
</MultimediaDocuments>

Error in the analysis of consecutive numeric entities

After named entities, we get for "1234 3.2 4,5":

<specific_entities>
<specific_entity>
  <string>1234 3.2</string>
  <position>1</position>
  <length>8</length>
  <type>Numex.NUMBER</type>
</specific_entity>
<specific_entity>
  <string>1234 3.2 4,5</string>
  <position>1</position>
  <length>12</length>
  <type>Numex.NUMBER</type>
</specific_entity>
</specific_entities>

while we should get three different entities.

Modex rules can be improved but not completly because we cannot have a numeric transition on real numbers, only on integers.

I tried to change the code to allow transitions on real numbers but it does not work. My try is on branch AutomatonTransitionOnDouble. I probably forgot to change something somewhere but I cannot figure out.
.

Weird lemmas

There are errors with some lemmas in the dictionary, e.g. "vous" is lemmatized as "cla" or "cln" (with pos-tag CLS): I guess these are categories instead of lemmas. This should be corrected in the generation of the source of the dictionary.

Travis.yml not working properly

Hi,

I'm trying to build Aymara using the provided Travis.yml script.

Line 14, apt-get update command seems to have some problems, here is the output sample :

Ign http://ubuntu.mirrors.ovh.net trusty InRelease
Ign http://ppa.launchpad.net trusty InRelease                                  
Ign http://ubuntu.mirrors.ovh.net trusty-updates InRelease                     
Ign http://ubuntu.mirrors.ovh.net trusty-backports InRelease                   
Ign http://ppa.launchpad.net trusty InRelease
...
Atteint http://security.ubuntu.com trusty-security/restricted Sources          
Atteint http://ubuntu.mirrors.ovh.net trusty-updates/universe amd64 Packages   
Err http://ppa.launchpad.net trusty/main amd64 Packages                        
  404  Not Found
Atteint http://ubuntu.mirrors.ovh.net trusty-updates/multiverse amd64 Packages 
Err http://ppa.launchpad.net trusty/main i386 Packages                         
  404  Not Found
Atteint http://ubuntu.mirrors.ovh.net trusty-updates/main i386 Packages        
Atteint http://ubuntu.mirrors.ovh.net trusty-updates/restricted i386 Packages
...
Atteint http://security.ubuntu.com trusty-security/universe i386 Packages
Atteint http://security.ubuntu.com trusty-security/multiverse i386 Packages
Atteint http://security.ubuntu.com trusty-security/main Translation-en
Atteint http://security.ubuntu.com trusty-security/multiverse Translation-en
Atteint http://security.ubuntu.com trusty-security/restricted Translation-en
Atteint http://security.ubuntu.com trusty-security/universe Translation-en
W: Impossible de récupérer http://ppa.launchpad.net/beineri/opt-qt532/ubuntu/dists/trusty/main/binary-amd64/Packages  404  Not Found

W: Impossible de récupérer http://ppa.launchpad.net/beineri/opt-qt532/ubuntu/dists/trusty/main/binary-i386/Packages  404  Not Found

E: Le téléchargement de quelques fichiers d'index a échoué, ils ont été ignorés, ou les anciens ont été utilisés à la place.

Enrich the CONLL dumper

The conll dumper should be enriched to allow the inclusion of coreference information.

In fact, it should be configurable to include or not each kind of information.

An option should also allow to output a header line with information on each column.

Thanks to xtannier for his suggestion.

CONLL dumper should not map categories by default

Currently, the conll dumper uses static mappings to map LIMA tags and relation names to CONLL ones. This should be done optionnaly and by default only output native LIMA tags and relation names. This would avoid outdated and incomplete mappings.

Coreference annotations are incomplete

In the fullXml output, the annotations (in the AnnotationGraph) concerning coreferences are incomplete. They miss the id of the refered token.

Thanks to xtannier for reporting.

Problem with automaton (out of range recognition)

This rule
@Number:(+|-)?:@Number{0-3} %?:NUMBER:=>NormalizeNumber()
is supposed to concatenate at least a series of 4 numbers and tag them as an entity NUMBER.

When LIMA analyses this example of text: 6 98 88 55 45 42 15. It concatenates all this sequence (a series of 7 numbers) as an entity NUMBER.

We should normally have two entities:
6 98 88 55
45 42 15

There is a bug in the automaton.

Error in handling location entities

When the editions of the rules for the extraction of components were started, it did not duplicate the rule file but simply worked on the LOCATION-fre.rules file on a branch.
When this work continued later (and especially to the merge and push), the LOCATION_COMP-fre.rules file was created but LOCATION-fre.rules file was not restored to its original state (before changes for components extraction).

We must therefore eliminate any "contamination" of the LOCATION-fre.rules file with component extraction operations, (everything that has a side effect on entities limits).
It is probably not necessary to checkout the LOCATION-fre.rules file to its state before the changes. We risk losing the corrections that have been made over since.

This is a probably not a big job. It must be done both for French and English.

(translated and adapted from OM explanations)

classe CharChart : il manque la catégorie unicode "NoCategory"

La méthode QChar::category() renvoit un indice qui permet de déterminer le libellé de la catégory unicode du caractère via le vecteur m_unicodeCategories.
Problème : les indices sont décalés de 1 du fait qu'il manque la catégorie "NoCategory".
La méthode const CharClass* CharChart::charClass (const LimaChar& c) const donne un résultat erroné.
Solution : ajouter la catégorie manquante.

m_unicodeCategories
  << "NoCategory"           <----- AJOUT catégorie manquante
  << "Mark_NonSpacing"
  << "Mark_SpacingCombining"
  << "Mark_Enclosing"
  << "Number_DecimalDigit"
  << "Number_Letter"
  << "Number_Other"
  << "Separator_Space"
  << "Separator_Line"
  << "Separator_Paragraph"
  << "Other_Control"
  << "Other_Format"
  << "Other_Surrogate"
  << "Other_PrivateUse"
  << "Other_NotAssigned"
  << "Letter_Uppercase"
  << "Letter_Lowercase"
  << "Letter_Titlecase"
  << "Letter_Modifier"
  << "Letter_Other"
  << "Punctuation_Connector"
  << "Punctuation_Dash"
  << "Punctuation_Open"
  << "Punctuation_Close"
  << "Punctuation_InitialQuote"
  << "Punctuation_FinalQuote"
  << "Punctuation_Other"
  << "Symbol_Math"
  << "Symbol_Currency"
  << "Symbol_Modifier"
  << "Symbol_Other";

Should be able to share subautomatons between several Modex rules files

Currently, the use keyword allows to list classes in external files to define gazeteers and the include keyword allows to compile rules from files external to the current rules file.

Also, it is possible to define subautomatons in a file that are used in several rules of this file.

But it is not possible to share subautomatons between rules files. This would be useful to avoid duplication and help the maintenance of rules files. In this case, error reporting should take this inclusion into account.

Unaccented dictionary entries are not built

Currently, during resources building unaccented entries are not built (with the unaccent.pl script for example). This means that words with wrong accentuation are not recognized as they were in old LIMA versions.

Should we implement that again or just count on the orthographic correction step ?

This old method allowed to recognize strings like "un" or "UN" as instances of "U.N.".

What sets LIMA apart from other tools?

There is no mention in the README or wiki about specific strengthes or weaknesses.

When should I use LIMA instead of other FOSS tools? This is the first question new users will ask themselves when discovering the tool, I think.

Problem with automaton and named entity recognition

We have some problems with number recognitions due to some changes in automaton code.

While testing only this rule
Number::@Number:NUMBER:=>NormalizeNumber() supposed to recognize a serie of two separed numbers as one Specific Entity with NUMBER type.
Lima analyzed this text:
6 98 88 32 45 44 88 44 88 444 88 110 111 112 223 555 888 777 111 11 12 1 2

and recognized the following entities that contain more than two numbers:

14Numex.NUMBER6 98 06 98 17Numex.NUMBER6 98 88 886 98 88 110Numex.NUMBER6 98 88 32 06 98 88 32 113Numex.NUMBER6 98 88 32 45 456 98 88 32 45 116Numex.NUMBER6 98 88 32 45 44 06 98 88 32 45 44 119Numex.NUMBER6 98 88 32 45 44 88 886 98 88 32 45 44 88 122Numex.NUMBER6 98 88 32 45 44 88 44 06 98 88 32 45 44 88 44 125Numex.NUMBER6 98 88 32 45 44 88 44 88 886 98 88 32 45 44 88 44 88 129Numex.NUMBER6 98 88 32 45 44 88 44 88 444 06 98 88 32 45 44 88 44 88 444 132Numex.NUMBER6 98 88 32 45 44 88 44 88 444 88 886 98 88 32 45 44 88 44 88 444 88 136Numex.NUMBER6 98 88 32 45 44 88 44 88 444 88 110 06 98 88 32 45 44 88 44 88 444 88 110 140Numex.NUMBER6 98 88 32 45 44 88 44 88 444 88 110 111 1116 98 88 32 45 44 88 44 88 444 88 110 111 144Numex.NUMBER6 98 88 32 45 44 88 44 88 444 88 110 111 112 06 98 88 32 45 44 88 44 88 444 88 110 111 112 148Numex.NUMBER6 98 88 32 45 44 88 44 88 444 88 110 111 112 223 2236 98 88 32 45 44 88 44 88 444 88 110 111 112 223 152Numex.NUMBER6 98 88 32 45 44 88 44 88 444 88 110 111 112 223 555 06 98 88 32 45 44 88 44 88 444 88 110 111 112 223 555 156Numex.NUMBER6 98 88 32 45 44 88 44 88 444 88 110 111 112 223 555 888 8886 98 88 32 45 44 88 44 88 444 88 110 111 112 223 555 888 160Numex.NUMBER6 98 88 32 45 44 88 44 88 444 88 110 111 112 223 555 888 777 06 98 88 32 45 44 88 44 88 444 88 110 111 112 223 555 888 777 164Numex.NUMBER6 98 88 32 45 44 88 44 88 444 88 110 111 112 223 555 888 777 111 1116 98 88 32 45 44 88 44 88 444 88 110 111 112 223 555 888 777 111 167Numex.NUMBER6 98 88 32 45 44 88 44 88 444 88 110 111 112 223 555 888 777 111 11 06 98 88 32 45 44 88 44 88 444 88 110 111 112 223 555 888 777 111 11 170Numex.NUMBER6 98 88 32 45 44 88 44 88 444 88 110 111 112 223 555 888 777 111 11 12 126 98 88 32 45 44 88 44 88 444 88 110 111 112 223 555 888 777 111 11 12 172Numex.NUMBER6 98 88 32 45 44 88 44 88 444 88 110 111 112 223 555 888 777 111 11 12 1 06 98 88 32 45 44 88 44 88 444 88 110 111 112 223 555 888 777 111 11 12 1 174Numex.NUMBER6 98 88 32 45 44 88 44 88 444 88 110 111 112 223 555 888 777 111 11 12 1 2 26 98 88 32 45 44 88 44 88 444 88 110 111 112 223 555 888 777 111 11 12 1 2

@Number=(
t_comma_number,t_dot_number,t_integer,
deux$NC,
trois$NC,
quatre$NC,
cinq$NC,
six$NC,
sept$NC,
huit$NC,
neuf$NC,
dix$NC,
onze$NC,
douze$NC,
treize$NC,
quatorze$NC,
quinze$NC,
seize$NC,
dix-sept$NC,
dix-huit$NC,
dix-neuf$NC,
vingt$NC,
vingts$NC,
vingt-deux$NC,
vingt-trois$NC,
vingt-quatre$NC,
vingt-cinq$NC,
vingt-six$NC,
vingt-sept$NC,
vingt-huit$NC,
vingt-neuf$NC,
trente$NC,
trente-deux$NC,
trente-trois$NC,
trente-quatre$NC,
trente-cinq$NC,
trente-six$NC,
trente-sept$NC,
trente-huit$NC,
trente-neuf$NC,
quarante$NC,
quarante-deux$NC,
quarante-trois$NC,
quarante-quatre$NC,
quarante-cinq$NC,
quarante-six$NC,
quarante-sept$NC,
quarante-huit$NC,
quarante-neuf$NC,
cinquante$NC,
cinquante-deux$NC,
cinquante-trois$NC,
cinquante-quatre$NC,
cinquante-cinq$NC,
cinquante-six$NC,
cinquante-sept$NC,
cinquante-huit$NC,
cinquante-neuf$NC,
soixante$NC,
soixante-deux$NC,
soixante-trois$NC,
soixante-quatre$NC,

soixante-cinq$NC,
soixante-six$NC,
soixante-sept$NC,
soixante-huit$NC,
soixante-neuf$NC,
septante$NC,
septante-deux$NC,
septante-trois$NC,
septante-quatre$NC,
septante-cinq$NC,
septante-six$NC,
septante-sept$NC,
septante-huit$NC,
septante-neuf$NC,
soixante-dix$NC,
soixante-douze$NC,
soixante-treize$NC,
soixante-quatorze$NC,
soixante-quinze$NC,
soixante-seize$NC,
soixante-dix-sept$NC,
soixante-dix-huit$NC,
soixante-dix-neuf$NC,
huitante$NC,
huitante-deux$NC,
huitante-trois$NC,
huitante-quatre$NC,
huitante-cinq$NC,
huitante-six$NC,
huitante-sept$NC,
huitante-huit$NC,
huitante-neuf$NC,
octante$NC,
octante-deux$NC,
octante-trois$NC,
octante-quatre$NC,
octante-cinq$NC,
octante-six$NC,
octante-sept$NC,
octante-huit$NC,
octante-neuf$NC,
quatre-vingt$NC,
quatre-vingts$NC,
quatre-vingt-un$NC,
quatre-vingt-deux$NC,
quatre-vingt-trois$NC,
quatre-vingt-quatre$NC,
quatre-vingt-cinq$NC,
quatre-vingt-six$NC,
quatre-vingt-sept$NC,
quatre-vingt-huit$NC,
quatre-vingt-neuf$NC,
nonante$NC,
nonante-deux$NC,
nonante-trois$NC,
nonante-quatre$NC,
nonante-cinq$NC,
nonante-six$NC,
nonante-sept$NC,
nonante-huit$NC,
nonante-neuf$NC,
quatre-vingt-dix$NC,
quatre-vingt-onze$NC,
quatre-vingt-douze$NC,
quatre-vingt-treize$NC,
quatre-vingt-quatorze$NC,
quatre-vingt-quinze$NC,
quatre-vingt-seize$NC,
quatre-vingt-dix-sept$NC,
quatre-vingt-dix-huit$NC,
quatre-vingt-dix-neuf$NC,
cent$NC,
cents$NC,
mille$NC
)

@OrdNumber=(
billionième$NC,
centième$NC,
cinquantième$NC,
cinquième$NC,
deuxième$NC,
dixième$NC,
douzième$NC,
huitantième$NC,
huitième$NC,
milliardième$NC,ecificEntities/NUMBER-fre.bin
millionième$NC,
millième$NC,
neuvième$NC,
onzième$NC,
premier$NC,
quarantième$NC,
quatorzième$NC,
quatre-vingtième$NC,
quatrième$NC,
quinzième$NC,
seizième$NC,
septantième$NC,
septième$NC,
sixième$NC,
soixantième$NC,
ter$NC,
treizième$NC,
trentième$NC,
trillionème$NC,
troisième$NC,
unième$NC,
vingtième$NC
)

Rule parser accept wrong syntax and generate buggy automaton

The rule below is wrong. Either there should be parentheses around t_capital_1st or the {1-3} should be moved out of the group:
@Street::,? (de la|de|du|des|à|aux)? ($NC|$NP|t_capital_1st{1-3}):LOCATION:

But the parser silently accepts it and produces an automaton which matches wrongly and produces a corrupted analyses graph. When analyzing Cette maison est la plus belle de la rue., "rue" is wrongly matched and replaced by a token with no linguistic data (see graph below)

graph
.

Error in refering to subautomatons parts

The first rule below compiles and works as expected while the second one fails to compile wih the message trying to get a subpart in a unit element. The second rule compiles and works as needed when replacing right.1.3 by right.3. This is false and unexpected.

define subautomaton NounGroup {
 pattern=$DET? (@Adverb{0-2} @Adj|@Substantif|@ConjCoord|@Participe|@DetNum|@PrepComp){0-n} @Substantif
}

@Copule:@OpenQuot %NounGroup (@Adj){0-n} @ClosQuot:(@Adverb){0-2} @PastParticiple:SYNTACTIC_RELATION:
+!GovernorOf(left.1,"ANY")
+SecondUngovernedBy(left.2.3,right.2,"ANY")
+CreateRelationBetween(left.2.3,right.2,"SUJ_V")
=>AddRelationInGraph()
=<ClearStoredRelations()

@DetNum::%NounGroup:SYNTACTIC_RELATION:
+SecondUngovernedBy(trigger.1,right.1.3,"ANY")
+CreateRelationBetween(trigger.1,right.1.3,"det")
=>AddRelationInGraph()
=<ClearStoredRelations()

Tokenizer automatons are incorrect

START state should only have ignore (/) actions as it should have no previous character.
Therefore, there should be no state coming back to START.

Problem with folders and LimaConf environment

Bug when folder names have accents. LimaConf files are not found.
Unable to open qslog configuration file:
/home/administrateur/Téléchargements/LivraisonMai/Dist/share/config/amose/log4cpp.properties
Configure Problem

Specific entities recognition error

In "Histoire de la seconde guerre mondiale.", the following rule should match (from EVENT-fre.rules):
guerre$NC:(seconde$ADJ|deuxième$ADJ):mondiale$ADJ:EVENT:seconde guerre mondiale

It does not. It works if tags are removed.

The tokens before SpecificEntities:

<vertex id="5">
  <token>
    <string>seconde</string>
    <position>16</position>
    <length>7</length>
    <t_status>
        <t_alpha>
          <t_alpha_capital>t_small</t_alpha_capital>
        </t_alpha>
        <t_default>t_small</t_default>
    </t_status>
  </token>
    <data>
      <simple_word>
      <form infl="seconde" lemma="second" norm="second">
        <property>
          <p prop="GENDER" val="FEM"/>
          <p prop="MACRO" val="ADJ"/>
          <p prop="MICRO" val="ADJ"/>
          <p prop="NUMBER" val="SING"/>
        </property>
        <property>
          <p prop="GENDER" val="FEM"/>
          <p prop="MACRO" val="NC"/>
          <p prop="MICRO" val="NC"/>
          <p prop="NUMBER" val="SING"/>
        </property>
        </form>
      <form infl="seconde" lemma="seconde" norm="seconde">
        <property>
          <p prop="MACRO" val="ADJ"/>
          <p prop="MICRO" val="ADJ"/>
        </property>
        <property>
          <p prop="GENDER" val="FEM"/>
          <p prop="MACRO" val="NC"/>
          <p prop="MICRO" val="NC"/>
          <p prop="NUMBER" val="SING"/>
        </property>
        </form>
      <form infl="seconde" lemma="seconder" norm="seconder">
        <property>
          <p prop="MACRO" val="V"/>
          <p prop="MICRO" val="V"/>
          <p prop="NUMBER" val="SING"/>
          <p prop="PERSON" val="3"/>
          <p prop="SYNTAX" val="INTRANS"/>
          <p prop="TIME" val="PRES"/>
        </property>
        <property>
          <p prop="MACRO" val="V"/>
          <p prop="MICRO" val="VIMP"/>
          <p prop="NUMBER" val="SING"/>
          <p prop="PERSON" val="2"/>
          <p prop="SYNTAX" val="INTRANS"/>
          <p prop="TIME" val="PRES"/>
        </property>
      </form>
    </simple_word>
    </data>
</vertex>
<vertex id="6">
  <token>
    <string>guerre</string>
    <position>24</position>
    <length>6</length>
    <t_status>
        <t_alpha>
          <t_alpha_capital>t_small</t_alpha_capital>
        </t_alpha>
        <t_default>t_small</t_default>
    </t_status>
  </token>
    <data>
      <simple_word>
      <form infl="guerre" lemma="guerre" norm="guerre">
        <property>
          <p prop="GENDER" val="FEM"/>
          <p prop="MACRO" val="NC"/>
          <p prop="MICRO" val="NC"/>
          <p prop="NUMBER" val="SING"/>
        </property>
      </form>
    </simple_word>
    </data>
</vertex>
<vertex id="7">
  <token>
    <string>mondiale</string>
    <position>31</position>
    <length>8</length>
    <t_status>
        <t_alpha>
          <t_alpha_capital>t_small</t_alpha_capital>
        </t_alpha>
        <t_default>t_small</t_default>
    </t_status>
  </token>
    <data>
      <simple_word>
      <form infl="mondiale" lemma="mondial" norm="mondial">
        <property>
          <p prop="GENDER" val="FEM"/>
          <p prop="MACRO" val="ADJ"/>
          <p prop="MICRO" val="ADJ"/>
          <p prop="NUMBER" val="SING"/>
        </property>
      </form>
    </simple_word>
    </data>
</vertex>

l'analyse d'un document en arabe plante l'analyseur

L'analyse du document ci-dessous jette une exception à la ligne 65 de lima_linguisticprocessing\src\linguisticProcessing\core\MorphologicAnalysis\AccentedConcatenatedDataHandler.cpp

namespace Lima { namespace LinguisticProcessing { namespace MorphologicAnalysis {

AccentedConcatenatedDataHandler::AccentedConcatenatedDataHandler(LinguisticGraph* outputGraph,
    const LimaString& sourceStr,
    uint64_t positionOffset,
    const TStatus& status,
    LinguisticAnalysisStructure::MorphoSyntacticType type,
    const FsaStringsPool* sp,
    FlatTokenizer::CharChart* charChart) :
    m_graph(outputGraph),
    m_srcStr(sourceStr),
    m_positionOffset(positionOffset),
    m_status(status),
    m_stringsPool(sp),
    m_charChart(charChart),
    m_concatVertices(),
    m_currentToken(0),
    m_currentData(0),
    m_currentElement()
{
  m_currentElement.type=type;
  
  std::vector<unsigned char> mapping;
  LimaString desacc=m_charChart->unmarkWithMapping(m_srcStr,mapping);
  m_unmarkToTextMapping.resize(mapping.size()+1);
  unsigned char i=0;
  for (std::vector<unsigned char>::const_iterator it=mapping.begin();
       it!=mapping.end();
       it++,i++)
  {
    m_unmarkToTextMapping[*it]=i;   <<<<<<< EXCEPTION ICI
  }

*it vaut 6 pour un tableau m_unmarkToTextMapping qui a une taille de 6 (0 à 5)

Ci-dessous la stacktrace partielle

msvcp100d.dll!std::_Debug_message(const wchar_t * message=0x000007fee78b1298, const wchar_t * file=0x000007fee78aed50, unsigned int line=932)  Ligne 15	C++
 	lima-lp-morphologicanalysis.dll!std::vector<unsigned char,std::allocator<unsigned char> >::operator[](unsigned __int64 _Pos=6)  Ligne 933	C++
	lima-lp-morphologicanalysis.dll!Lima::LinguisticProcessing::MorphologicAnalysis::AccentedConcatenatedDataHandler::AccentedConcatenatedDataHandler(boost::adjacency_list<boost::vecS,boost::vecS,boost::bidirectionalS,boost::property<enum vertex_chain_id_t,std::set<Lima::LinguisticProcessing::LinguisticAnalysisStructure::ChainIdStruct,std::less<Lima::LinguisticProcessing::LinguisticAnalysisStructure::ChainIdStruct>,std::allocator<Lima::LinguisticProcessing::LinguisticAnalysisStructure::ChainIdStruct> >,boost::property<enum boost::vertex_color_t,enum boost::default_color_type,boost::property<enum vertex_data_t,Lima::LinguisticProcessing::LinguisticAnalysisStructure::MorphoSyntacticData *,boost::property<enum vertex_token_t,Lima::LinguisticProcessing::LinguisticAnalysisStructure::Token *,boost::no_property> > > >,boost::no_property,boost::no_property,boost::listS> * outputGraph=0x0000000023db1b90, const QString & sourceStr={...}, unsigned __int64 positionOffset=39122, const Lima::LinguisticProcessing::LinguisticAnalysisStructure::TStatus & status={...}, Lima::LinguisticProcessing::LinguisticAnalysisStructure::MorphoSyntacticType type=SIMPLE_WORD, const Lima::FsaStringsPool * sp=0x00000000032708f0, Lima::LinguisticProcessing::FlatTokenizer::CharChart * charChart=0x00000000207a06b0)  Ligne 65 + 0x23 octets	C++
 	lima-lp-morphologicanalysis.dll!Lima::LinguisticProcessing::MorphologicAnalysis::SimpleWord::process(Lima::AnalysisContent & analysis={...})  Ligne 189 + 0xb5 octets	C++
 	lima-common-mediaprocessors.dll!Lima::ProcessUnitPipeline<Lima::MediaProcessUnit>::process(Lima::AnalysisContent & analysis={...})  Ligne 104 + 0x36 octets	C++
 	lima-lp-linguisticprocessing-core.dll!Lima::LinguisticProcessing::CoreLinguisticProcessingClient::analyze(const QString & texte={...}, const std::map<std::basic_string<char,std::char_traits<char>,std::allocator<char> >,std::basic_string<char,std::char_traits<char>,std::allocator<char> >,std::less<std::basic_string<char,std::char_traits<char>,std::allocator<char> > >,std::allocator<std::pair<std::basic_string<char,std::char_traits<char>,std::allocator<char> > const ,std::basic_string<char,std::char_traits<char>,std::allocator<char> > > > > & metaData=[10](("ElementName", "ara"),("FileName", "3967"),("Filename", "3967"),("Lang", "ara"),("StartOffset", "375"),("StartOffsetIndexingNode", "375"),("Type", ""),("docid", "3967"),("filePath", ""),("pipeline", "indexer")), const std::basic_string<char,std::char_traits<char>,std::allocator<char> > & pipelineId="indexer", const std::map<std::basic_string<char,std::char_traits<char>,std::allocator<char> >,Lima::AbstractAnalysisHandler *,std::less<std::basic_string<char,std::char_traits<char>,std::allocator<char> > >,std::allocator<std::pair<std::basic_string<char,std::char_traits<char>,std::allocator<char> > const ,Lima::AbstractAnalysisHandler *> > > & handlers=[1](("xmlDocumentHandler", 0x000000000ef75e40 {m_out=0x0000000024b1c450 })), const std::set<std::basic_string<char,std::char_traits<char>,std::allocator<char> >,std::less<std::basic_string<char,std::char_traits<char>,std::allocator<char> > >,std::allocator<std::basic_string<char,std::char_traits<char>,std::allocator<char> > > > & inactiveUnits=[0](), const Lima::StopAnalyze & stopAnalyze={...})  Ligne 230 + 0x1e octets	C++
 	lima-lp-linguisticprocessing-core.dll!Lima::LinguisticProcessing::CoreLinguisticProcessingClient::analyze(const std::basic_string<char,std::char_traits<char>,std::allocator<char> > & texte="...", const std::map<std::basic_string<char,std::char_traits<char>,std::allocator<char> >,std::basic_string<char,std::char_traits<char>,std::allocator<char> >,std::less<std::basic_string<char,std::char_traits<char>,std::allocator<char> > >,std::allocator<std::pair<std::basic_string<char,std::char_traits<char>,std::allocator<char> > const ,std::basic_string<char,std::char_traits<char>,std::allocator<char> > > > > & metaData=[10](("ElementName", "ara"),("FileName", "3967"),("Filename", "3967"),("Lang", "ara"),("StartOffset", "375"),("StartOffsetIndexingNode", "375"),("Type", ""),("docid", "3967"),("filePath", ""),("pipeline", "indexer")), const std::basic_string<char,std::char_traits<char>,std::allocator<char> > & pipelineId="indexer", const std::map<std::basic_string<char,std::char_traits<char>,std::allocator<char> >,Lima::AbstractAnalysisHandler *,std::less<std::basic_string<char,std::char_traits<char>,std::allocator<char> > >,std::allocator<std::pair<std::basic_string<char,std::char_traits<char>,std::allocator<char> > const ,Lima::AbstractAnalysisHandler *> > > & handlers=[1](("xmlDocumentHandler", 0x000000000ef75e40 {m_out=0x0000000024b1c450 })), const std::set<std::basic_string<char,std::char_traits<char>,std::allocator<char> >,std::less<std::basic_string<char,std::char_traits<char>,std::allocator<char> > >,std::allocator<std::basic_string<char,std::char_traits<char>,std::allocator<char> > > > & inactiveUnits=[0](), const Lima::StopAnalyze & stopAnalyze={...})  Ligne 85 + 0x4c octets	C++

BOCL.PDF

Memory leak in normalizeTerm tool

There is an important memory leak in normalizeTerm. Maybe due to d-pointers ?
Here is the output of valgrind:

==7602== 6,000 bytes in 50 blocks are definitely lost in loss record 951 of 968
==7602==    at 0x4C2B0E0: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==7602==    by 0x94B1690: Lima::Node::Node() (in /data/Projets/AMOSE/Sources/amose-install/AMOSE/DistLima/lib/liblima-common-data.so.SOVERSION)
==7602==    by 0x94B2D9A: std::_Rb_tree_iterator > std::_Rb_tree, std::_Select1st >, std::less, std::allocator > >::_M_emplace_hint_unique, std::tuple<> >(std::_Rb_tree_const_iterator >, std::piecewise_construct_t const&, std::tuple&&, std::tuple<>&&) (in /data/Projets/AMOSE/Sources/amose-install/AMOSE/DistLima/lib/liblima-common-data.so.SOVERSION)
==7602==    by 0x94B2795: Lima::Structure::addNode(Lima::Node const&) (in /data/Projets/AMOSE/Sources/amose-install/AMOSE/DistLima/lib/liblima-common-data.so.SOVERSION)
==7602==    by 0x5B0BFD1: Lima::LinguisticProcessing::BowTextHandler::endAnalysis() (in /data/Projets/AMOSE/Sources/amose-install/AMOSE/DistLima/lib/liblima-lp-analysishandlers.so.2.0.1)
==7602==    by 0xAE33810: Lima::DumperStream::~DumperStream() (in /data/Projets/AMOSE/Sources/amose-install/AMOSE/DistLima/lib/liblima-common-mediaprocessors.so.SOVERSION)
==7602==    by 0x65FFB60: Lima::LinguisticProcessing::AnalysisDumpers::BowDumper::process(Lima::AnalysisContent&) const (in /data/Projets/AMOSE/Sources/amose-install/AMOSE/DistLima/lib/liblima-lp-analysisdumpers.so.2.0.1)
==7602==    by 0x50545AE: Lima::ProcessUnitPipeline::process(Lima::AnalysisContent&) const (in /data/Projets/AMOSE/Sources/amose-install/AMOSE/DistLima/lib/liblima-lp-linguisticprocessing-core.so.2.0.1)
==7602==    by 0x5051035: Lima::LinguisticProcessing::CoreLinguisticProcessingClient::analyze(QString const&, std::map, std::allocator > > const&, std::string const&, std::map, std::allocator > > const&, std::set, std::allocator > const&) const (in /data/Projets/AMOSE/Sources/amose-install/AMOSE/DistLima/lib/liblima-lp-linguisticprocessing-core.so.2.0.1)
==7602==    by 0x40A1A8: dowork(int, char**) (in /data/Projets/AMOSE/Sources/amose-install/AMOSE/DistLima/bin/normalizeTerm)
==7602==    by 0x406DA1: main (in /data/Projets/AMOSE/Sources/amose-install/AMOSE/DistLima/bin/normalizeTerm)

Normalisation of real numbers does not work

Note: this issue completes issue #50 that was covering several problems including this one.

When analysing "123 45.6 . 12 345.6", we should get three number entities with the correct numeric values:

  • 123
  • 45.6
  • 12345.6

But we get (simplified):

  <type>Numex.NUMBER</type>
  <string>123 45.6</string>
  <numvalue>0</numvalue>

  <type>Numex.NUMBER</type>
  <string>12 345.6</string>
  <numvalue>0</numvalue>

The changes on branch https://github.com/aymara/lima/tree/AutomatonTransitionOnDouble try to handle the two problems of correctly recognizing the entities and correctly normalizing them. But for an unknown reason, the changes do not work as expected.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.