We have detected an indexing problem with perl_lib/EPrints/Index/Tokenizer.pm <p d

There looks to be two issues here: The collation for <code cla

Index/Tokenizer problem (RHEL 8, perl 5.26) about eprints3.4 HOT 6 CLOSED

mpbraendle commented on August 17, 2024

Index/Tokenizer problem (RHEL 8, perl 5.26)

from eprints3.4.

Comments (6)

drn05r commented on August 17, 2024

There looks to be two issues here:

The collation for eprint__rindex.word is utf8_bin if this is utf8_general_ci then if you search for Bzdusek then you get a result even of the eprint__rindex.word value is set to bzdušek.
If the search term contains accented characters rather than filtering onbzdušek for eprint__rindex.word it tries to filter on bzdu and ek as the only indexed term is bzdušek and the filter match exactly (not partial) then no results are found.

So 1 can be fixed by hand but some investigation is needed to ensure this table is created differently in future to use utf8_general_ci collation rather than utf8_bin, so this does not need to be manually fixed on new repositories. utf8_general_ci collation is already used for the eprint__rindex table but not for the word amongst other fields it contains.

2 I expect will require a code change because there must be something this treats accented characters as a separator rather than a valid character that may appear in a value for eprint__rindex.word.

from eprints3.4.

mpbraendle commented on August 17, 2024

I have debugged to some part.

Point 1 should be irrelevant - Index::Tokenizer::apply_mapping should take care that only ASCII characters are stored in the word .
For author names, it is called from MetaField::Name::get_search_conditions (for searching) and MetaField::Name::get_index_codes_basic. Due to some reason, the encoding seems not to be correct and apply_mapping doesn't transliterate.
Also, apply_mapping is able to transliterate multiple characters into one and also has special rules. I doubt that utf8_general_ci will do that.

Point 2 is not clear, if apply_mapping works as expected from MetaField::Name::get_index_codes_basic, that shouldn't be relevant, neither.

You are correct, that a form of utf8_bin is used currently, in our case it is utf8mb3_bin:

from eprints3.4.

ajmetz commented on August 17, 2024

There is a lot I still have to learn about EPrints, and indeed Unicode. Am wondering why the need to have character mapping in the first place. Is it helpful to know that diacritic(accent) insensitive matching/comparison is possible with Perl Core / Perl's Standard Modules? https://www.perl.com/pub/2012/06/perlunicook-case--and-accent-insensitive-comparison.html/

from eprints3.4.

mpbraendle commented on August 17, 2024

Here a follow-up:

After two full days of debugging and trying out many variants and getting more gray hair, we think it is a problem how the hash $EPrints::Index::FREETEXT_CHAR_MAPPING in Index/Tokenizer.pm is addressed.
This behaves completely erratically, sometimes š is translated to s, sometimes not. It is as sometimes the hash would not exist.
This problem is observed when characters with UTF codepoint > 0x00ff are used (non-Ascii chars).
It might be that a “use 5.8.0” might remedy this (not tried out) by using the old Unicode implementation of perl.

However, we applied a solution now that we also use in cfg.d/optional_filename_sanitise.pl to transliterate file names and in several import plugins, which is much simpler and failsafe: Text::Unidecode

This library separates the upper and lower bytes of an UTF8 char and then adresses the transliteration tables, which are arrays, not hashes, by the respective integer value of the UTF8 bytes.
Since the transliteration tables are very extensive, maintaining $EPrints::Index::FREETEXT_CHAR_MAPPING is not necessary at all.
Also, it is possible to override the Text::Unidecode transliteration tables if one needs to. See https://metacpan.org/pod/Text::Unidecode
Also, I see that it’s part of the EPrints 3.3 package (but has been removed with EPrints 3.4).

from eprints3.4.

drn05r commented on August 17, 2024

I think Text::Unidecode got removed from being packaged with EPrints because its was felt liable to be updated on a regular basis so was not an ideal module to package within the codebase. Looking at https://metacpan.org/dist/Text-Unidecode/changes it does not look like it is actually updated that much, so that concern may be moot. However, the general practice of packaging Perl modules in the codebase that are available in CPAN, seems like the wrong thing to do IMO. A better option is to work on ensuring https://github.com/eprints/eprints3.4/blob/HEAD/cpan_modules.pl is kept up to date, which is being look at in #373.

I agree that maintaining EPrints' own $EPrints::Index::FREETEXT_CHAR_MAPPING seems like a time-consuming exercise, which is duplicating effort that someone else has already done a more comprehensive job. I am not sure if this is something that can be dealt with for the next release so I am going to punt this into be looked into in the next release. Unless you can provided specific advice on how to amend the code suitably to move to using Text::Unidecode directly in the core codebase.

from eprints3.4.

drn05r commented on August 17, 2024

I have taken a closer look at this an inserting a unidecode should negate the need for much of the $EPrints::Index::FREETEXT_CHAR_MAPPING. However, it feels like it is best to leave this in for now even though theoretically it should have minimal effect, as anything it would have remapped will have already been done by by Text::Unidecode::unidecode.

from eprints3.4.

Index/Tokenizer problem (RHEL 8, perl 5.26) about eprints3.4 HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs