Hi, in what way is specification of collation for non-English langua

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Closing as duplicate of <a aria-label="Issue #41" class="issue-link js-issue-link" dat

Collation for non-English languages about snowstorm HOT 7 CLOSED

ihtsdo commented on August 10, 2024

Collation for non-English languages

from snowstorm.

Comments (7)

kaicode commented on August 10, 2024

Hi Daniel,

Currently the results of a term search in Snowstorm are sorted by length only.
Shorter terms should come back first. Alphabetical sorting is not currently implemented in any language so sorting by languages other than english has not been used.

Sorting by length seems to work well. Is this sufficient for you?

Kai

from snowstorm.

danka74 commented on August 10, 2024

Hi Kai,
this is not so much about sorting (which is relevant as well) as it is about character matching. In Swedish o and ö are distinct characters and should not match, while e.g. in German ö is just a variant (umlaut) of o and here they do match. This is kept in different collation rules for each language.
So, this is a quite important function for non-English languages, but it has basic support in elastic, see https://www.elastic.co/guide/en/elasticsearch/guide/master/character-folding.html
/Daniel

from snowstorm.

danka74 commented on August 10, 2024

For reference, I've added mongodb collations to the sct-snapshot-rest-api in this commit: danka74/sct-snapshot-rest-api@e704c32

from snowstorm.

danka74 commented on August 10, 2024

Did some testing with a local installation of snowstorm. Currently it seems that strings are matched binary not using any collation rules, e.g. searching for "magyar agar" returns no hits whereas "magyar agår" returns 132436001 | Magyar Agår dog breed (organism) | whereas the snapshot-api uses hardcoded folding of characters (e.g. 'å' becomes 'a') if selected.

from snowstorm.

kaicode commented on August 10, 2024

Hi @danka74, sorry slow response, I've been away.

Yes, the current behaviour is to not convert any special characters to a simpler form during search but match using the variant given. It sounds like this is not adequate for some languages like German. Thanks for your example to help me understand this.

Although we have the language code to hand when we index Description components it may not be necessary to change the analyser at index time. The simplest approach may be to rely on the request language header and to use a different search analyser based on the language being requested. If terms in the German language are being requested both the exact characters in the search string and the folded version could be used to match descriptions. Matches against the original search characters should probably be given a greater search score. Would that work for you?

from snowstorm.

kaicode commented on August 10, 2024

For the record; Daniel and I have started a branch to collaborate on this feature. We will play around to find the best Elasticsearch settings. We have identified that it would be best to set the correct Elasticsearch language analyser at index time. Using the Description language code field during import / component creation to set the analyser is a possible solution. We will continue to play with this as time allows.

from snowstorm.

kaicode commented on August 10, 2024

Closing as duplicate of #41 which has had more recent chatter and is now fixed in dev.

from snowstorm.

Collation for non-English languages about snowstorm HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs