GithubHelp home page GithubHelp logo

Comments (6)

drn05r avatar drn05r commented on August 17, 2024

There looks to be two issues here:

  1. The collation for eprint__rindex.word is utf8_bin if this is utf8_general_ci then if you search for Bzdusek then you get a result even of the eprint__rindex.word value is set to bzdušek.
  2. If the search term contains accented characters rather than filtering onbzdušek for eprint__rindex.word it tries to filter on bzdu and ek as the only indexed term is bzdušek and the filter match exactly (not partial) then no results are found.

So 1 can be fixed by hand but some investigation is needed to ensure this table is created differently in future to use utf8_general_ci collation rather than utf8_bin, so this does not need to be manually fixed on new repositories. utf8_general_ci collation is already used for the eprint__rindex table but not for the word amongst other fields it contains.

2 I expect will require a code change because there must be something this treats accented characters as a separator rather than a valid character that may appear in a value for eprint__rindex.word.

from eprints3.4.

mpbraendle avatar mpbraendle commented on August 17, 2024

I have debugged to some part.

Point 1 should be irrelevant - Index::Tokenizer::apply_mapping should take care that only ASCII characters are stored in the word .
For author names, it is called from MetaField::Name::get_search_conditions (for searching) and MetaField::Name::get_index_codes_basic. Due to some reason, the encoding seems not to be correct and apply_mapping doesn't transliterate.
Also, apply_mapping is able to transliterate multiple characters into one and also has special rules. I doubt that utf8_general_ci will do that.

Point 2 is not clear, if apply_mapping works as expected from MetaField::Name::get_index_codes_basic, that shouldn't be relevant, neither.

You are correct, that a form of utf8_bin is used currently, in our case it is utf8mb3_bin:

show full columns from eprint__rindex;
+----------+--------------+-------------+------+-----+---------+-------+---------------------------------+---------+
| Field | Type | Collation | Null | Key | Default | Extra | Privileges | Comment |
+----------+--------------+-------------+------+-----+---------+-------+---------------------------------+---------+
| eprintid | int(11) | NULL | NO | PRI | 0 | | select,insert,update,references | |
| field | varchar(64) | utf8mb3_bin | NO | PRI | | | select,insert,update,references | |
| word | varchar(128) | utf8mb3_bin | NO | PRI | | | select,insert,update,references | |
+----------+--------------+-------------+------+-----+---------+-------+---------------------------------+---------+

from eprints3.4.

ajmetz avatar ajmetz commented on August 17, 2024

There is a lot I still have to learn about EPrints, and indeed Unicode. Am wondering why the need to have character mapping in the first place. Is it helpful to know that diacritic(accent) insensitive matching/comparison is possible with Perl Core / Perl's Standard Modules? https://www.perl.com/pub/2012/06/perlunicook-case--and-accent-insensitive-comparison.html/

from eprints3.4.

mpbraendle avatar mpbraendle commented on August 17, 2024

Here a follow-up:

After two full days of debugging and trying out many variants and getting more gray hair, we think it is a problem how the hash $EPrints::Index::FREETEXT_CHAR_MAPPING in Index/Tokenizer.pm is addressed.
This behaves completely erratically, sometimes š is translated to s, sometimes not. It is as sometimes the hash would not exist.
This problem is observed when characters with UTF codepoint > 0x00ff are used (non-Ascii chars).
It might be that a “use 5.8.0” might remedy this (not tried out) by using the old Unicode implementation of perl.

However, we applied a solution now that we also use in cfg.d/optional_filename_sanitise.pl to transliterate file names and in several import plugins, which is much simpler and failsafe: Text::Unidecode

This library separates the upper and lower bytes of an UTF8 char and then adresses the transliteration tables, which are arrays, not hashes, by the respective integer value of the UTF8 bytes.
Since the transliteration tables are very extensive, maintaining $EPrints::Index::FREETEXT_CHAR_MAPPING is not necessary at all.
Also, it is possible to override the Text::Unidecode transliteration tables if one needs to. See https://metacpan.org/pod/Text::Unidecode
Also, I see that it’s part of the EPrints 3.3 package (but has been removed with EPrints 3.4).

from eprints3.4.

drn05r avatar drn05r commented on August 17, 2024

I think Text::Unidecode got removed from being packaged with EPrints because its was felt liable to be updated on a regular basis so was not an ideal module to package within the codebase. Looking at https://metacpan.org/dist/Text-Unidecode/changes it does not look like it is actually updated that much, so that concern may be moot. However, the general practice of packaging Perl modules in the codebase that are available in CPAN, seems like the wrong thing to do IMO. A better option is to work on ensuring https://github.com/eprints/eprints3.4/blob/HEAD/cpan_modules.pl is kept up to date, which is being look at in #373.

I agree that maintaining EPrints' own $EPrints::Index::FREETEXT_CHAR_MAPPING seems like a time-consuming exercise, which is duplicating effort that someone else has already done a more comprehensive job. I am not sure if this is something that can be dealt with for the next release so I am going to punt this into be looked into in the next release. Unless you can provided specific advice on how to amend the code suitably to move to using Text::Unidecode directly in the core codebase.

from eprints3.4.

drn05r avatar drn05r commented on August 17, 2024

I have taken a closer look at this an inserting a unidecode should negate the need for much of the $EPrints::Index::FREETEXT_CHAR_MAPPING. However, it feels like it is best to leave this in for now even though theoretically it should have minimal effect, as anything it would have remapped will have already been done by by Text::Unidecode::unidecode.

from eprints3.4.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.