GithubHelp home page GithubHelp logo

Comments (13)

danka74 avatar danka74 commented on August 10, 2024

Yes, Alejandro, this is the case, but (likely) has to be addressed when indexing and querying. I have done some experiments in this branch: https://github.com/danka74/snowstorm/tree/swedish-experimental-dk but here indexing is hard-coded, which is not what we want, see this commit: danka74@701f082
/Daniel

from snowstorm.

alopezo avatar alopezo commented on August 10, 2024

Hi Daniel, that's exactly what is need, you are right.
I wonder if a generic analizer that “folds” all accented characters to plain ascii would be good enough for different languages, it would sure be for Spanish. We can provide a list of Spanish Language accented letters that may be added to that configuration.
Does this modification affect the results on english language in some way? It would be ideal to add this as the standard way to index and search.

Thanks

from snowstorm.

danka74 avatar danka74 commented on August 10, 2024

@alopezo , this would unfortunately not work for Swedish with the characters ÅÄÖ which should not be folded. I see that in the few English words where Scandinavian characters are used (like Ångström, the length unit) SNOMED has used a folded term (here Angstrom), so maybe there is a "universal" set of characters which should be folded (e.g. É to E) which excludes ÅÄÖ.
/Daniel

from snowstorm.

alopezo avatar alopezo commented on August 10, 2024

Defining that set would be a great first step, and much simpler to implement as it would not require additional configurations on index/search, for example in spanish we would like to fold:

áéíóúüñ

Maybe we can start with a short list of these and check use cases from other languages.

/Alejandro

from snowstorm.

kaicode avatar kaicode commented on August 10, 2024

@alopezo Elasticsearch has built in support for appropriate character folding in each language. We plan to add a feature to Snowstorm to allow search to work for all languages where the correct language index analyser is picked at index time using the description language code field.

The correct analyser would also need to be used at search time in some cases. I'm still thinking about the best way to achieve this. Perhaps the Accept-Language header in the search request could be used to select a set of language specific search analysers?

from snowstorm.

alopezo avatar alopezo commented on August 10, 2024

Yes, this would be a good solution for us, accents folding for Spanish based on the accept-language-header.

I'm reading the documentation of the language analyzers for elastic search:

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html

They don't propose folding but it seems like it's a straightforward task to add..

Thanks!

from snowstorm.

kaicode avatar kaicode commented on August 10, 2024

I'm compiling the list of characters for each language which should not be folded/simplified because they are unique in a that language.

In Swedish the characters which should not be folded are: åäö.
In Spanish I think the characters which should not be folded are: áéíóúüñ.

I'm making the assumption that all characters can be made lowercase during processing for search, regardless of diacritics, so we only need to capture the lowercase versions of each character which must not be folded.

from snowstorm.

danka74 avatar danka74 commented on August 10, 2024

Some more:
Danish/Norwegian: å æ ø
Finnish: same as Swedish.

Perhaps a request to the Content Managers AG?
/Daniel

from snowstorm.

danka74 avatar danka74 commented on August 10, 2024

Some more:
Danish/Norwegian: å æ ø
Finnish: same as Swedish.

Perhaps a request to the Content Managers AG?
/Daniel

Posted a discussion item on the CMAG discussion page!

from snowstorm.

kaicode avatar kaicode commented on August 10, 2024

This feature is working so I'll close this ticket.
Only Swedish and Spanish characters are in configuration so far.
See "Search International Character Handling" in application.properties
Looking forward to adding more languages to configuration using another issue or pull request.

from snowstorm.

CWdanielsen avatar CWdanielsen commented on August 10, 2024

The order of the Danish letters is: æ ø å / Æ Ø Å.

from snowstorm.

kaicode avatar kaicode commented on August 10, 2024

Thanks @CWdanielsen, I've added these in the develop branch. They will go out in the next release.

from snowstorm.

CWdanielsen avatar CWdanielsen commented on August 10, 2024

Thanks Kai, and they are the last three letters in the DK alphabet after a-z/A-Z.

from snowstorm.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.