Comments (13)
Yes, Alejandro, this is the case, but (likely) has to be addressed when indexing and querying. I have done some experiments in this branch: https://github.com/danka74/snowstorm/tree/swedish-experimental-dk but here indexing is hard-coded, which is not what we want, see this commit: danka74@701f082
/Daniel
from snowstorm.
Hi Daniel, that's exactly what is need, you are right.
I wonder if a generic analizer that “folds” all accented characters to plain ascii would be good enough for different languages, it would sure be for Spanish. We can provide a list of Spanish Language accented letters that may be added to that configuration.
Does this modification affect the results on english language in some way? It would be ideal to add this as the standard way to index and search.
Thanks
from snowstorm.
@alopezo , this would unfortunately not work for Swedish with the characters ÅÄÖ which should not be folded. I see that in the few English words where Scandinavian characters are used (like Ångström, the length unit) SNOMED has used a folded term (here Angstrom), so maybe there is a "universal" set of characters which should be folded (e.g. É to E) which excludes ÅÄÖ.
/Daniel
from snowstorm.
Defining that set would be a great first step, and much simpler to implement as it would not require additional configurations on index/search, for example in spanish we would like to fold:
áéíóúüñ
Maybe we can start with a short list of these and check use cases from other languages.
/Alejandro
from snowstorm.
@alopezo Elasticsearch has built in support for appropriate character folding in each language. We plan to add a feature to Snowstorm to allow search to work for all languages where the correct language index analyser is picked at index time using the description language code field.
The correct analyser would also need to be used at search time in some cases. I'm still thinking about the best way to achieve this. Perhaps the Accept-Language header in the search request could be used to select a set of language specific search analysers?
from snowstorm.
Yes, this would be a good solution for us, accents folding for Spanish based on the accept-language-header.
I'm reading the documentation of the language analyzers for elastic search:
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html
They don't propose folding but it seems like it's a straightforward task to add..
Thanks!
from snowstorm.
I'm compiling the list of characters for each language which should not be folded/simplified because they are unique in a that language.
In Swedish the characters which should not be folded are: åäö
.
In Spanish I think the characters which should not be folded are: áéíóúüñ
.
I'm making the assumption that all characters can be made lowercase during processing for search, regardless of diacritics, so we only need to capture the lowercase versions of each character which must not be folded.
from snowstorm.
Some more:
Danish/Norwegian: å æ ø
Finnish: same as Swedish.
Perhaps a request to the Content Managers AG?
/Daniel
from snowstorm.
Some more:
Danish/Norwegian: å æ ø
Finnish: same as Swedish.Perhaps a request to the Content Managers AG?
/Daniel
Posted a discussion item on the CMAG discussion page!
from snowstorm.
This feature is working so I'll close this ticket.
Only Swedish and Spanish characters are in configuration so far.
See "Search International Character Handling" in application.properties
Looking forward to adding more languages to configuration using another issue or pull request.
from snowstorm.
The order of the Danish letters is: æ ø å / Æ Ø Å.
from snowstorm.
Thanks @CWdanielsen, I've added these in the develop branch. They will go out in the next release.
from snowstorm.
Thanks Kai, and they are the last three letters in the DK alphabet after a-z/A-Z.
from snowstorm.
Related Issues (20)
- Add attributes to the custom code system HOT 3
- Error creating bean with name 'snowstormApplication' when run with jar file version 8.2.2 HOT 1
- Does Snowstorm implement all get resource api from hapi-fhir HOT 1
- Getting error -Importing a new International Edition HOT 13
- Snowstorm release 8.2.2 - Swagger page does not load HOT 2
- Access Denied to SNOMED CT HOT 2
- Incorrect support of TerminologyCapabilities (FHIR API) HOT 2
- Timeout connecting to Elasticsearch HOT 1
- 8.3.0 install on Ubuntu has startup warning HOT 5
- Loading an edition and extension via Postman HOT 9
- FHIR API - Expanding ECL ValueSet is excluding active refset members when the referenced concept is inactive HOT 1
- Updating the FHIR CodeSystem URI for the default Edition on the MAIN branch (UK Monolith) HOT 3
- National Extension branch management - accessing multiple releases HOT 3
- ecl Description Filter is scanning Text Definitions HOT 2
- ECL Bug - excluding any concept from a refset removes all inactive concepts from the results
- Unable to get Global User roles in Branch HOT 2
- Getting started: Not obvious where `jvm.options` is located HOT 1
- Releases MD5 file not properly formatted for `md5sum` to work: Missing `.jar` file name HOT 2
- Getting started -> Start Snowstorm: Missing and conflicting info HOT 1
- Do you all use the Terminology server to search Concepts as in the SnomedSearchDemo ? (Quick answer: Yes) HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from snowstorm.