GithubHelp home page GithubHelp logo

czech-sort's Introduction

Czech Sort

This is a pure-Python library for Czech-language alphabetical sorting.

Quick Use

From Python:

>>> import czech_sort

>>> czech_sort.sorted(['sídliště', 'shoda', 'schody'])
['shoda', 'schody', 'sídliště']

>>> sorted(['sídliště', 'shoda', 'schody'], key=czech_sort.key)
['shoda', 'schody', 'sídliště']

On the command line::

$ python -m czech_sort < file.txt
shoda
schody
sídliště

Why another sorting library?

To sort Python strings in the Czech language, there are three other options:

  • Use PyICU. This can sort really well, and do all kinds of wonderful, standards-compliant Unicode things. Perfect for publication-quality results. Unfortunately, ICU can be a major pain to install, making it overkill if you just want to sort a list of strings.
  • Set the locale, then use locale.strxfrm. (Yes, strxfrm! Try saying that ten times fast!) This depends on the Czech POSIX locale being available, so it's hardly portable.
  • Just use Python's built-in string sort. This sorts lexicographically by Unicode codepoints. It might be good enough for you? Maybe?

Scope

The czech-sort library is a compromise. It should give you good results in the 99% case.

Do not use this if you need proper sorting of symbols, non-Latin scripts, or diacritics other than Czech/Slovak.

Any other deviation from the relevant standard, ČSN 97 6030, should be considered a bug. However, neither the author nor the community at large have access to the standard, which makes finding such bugs somewhat difficult.

Full API

czech_sort.sorted(iterable)

Takes an iterable of strings, and returns a list of them, sorted.

czech_sort.key(s)

Returns a sort key object for a given string.

This function is suitable as the key for functions like the built-in sorted or list.sort.

czech_sort.bytes_key(s)

Returns a sort key for a given string, as bytes.

This is suitable as a DB-API custom function like the built-in sqlite3 connection's create_function.

WARNING: Do not store the results of this function. The format can change in future versions of czech_sort.

Installation

Install this into your virtualenv by running:

$ pip install czech-sort

Contribute

Bug reports and comments are welcome at Github.

Patches are also welcome! Source code is hosted at Github:

$ git clone http://github.com/encukou/czech-sort

To run the included tests:

$ python -m pip install -e.[test]
$ python -m pytest

If you would like to contribute, but are confused by the above, then please e-mail encukou at gmail dot com.

License

The project is licensed under the MIT license. May it serve you well.

czech-sort's People

Contributors

encukou avatar honzajavorek avatar jiri-one avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

czech-sort's Issues

Numbers sorting as last

I tested out the new version and it seems to correctly sort Š! However, it sorts numbers as last. I have no idea whether this is expected behavior or not, because I don't know ČSN 97 6030, but I'm reporting this in case you'd have an idea. Feel free to just close this if this is how we Czechs actually sort 😅

Screenshot 2023-07-12 at 9 53 21

Question: Using in SQLite

SQLite offers a way to easily add custom functions written in Python. I wondered if I could register czech_sort.key as a custom function, because then I could have Czech sort on any field in my database with ease and portability. However, SQLite doesn't seem to be happy with tuples, raising an exception:

sqlite3.ProgrammingError: User-defined functions cannot return 'tuple' values to SQLite

I looked at the tuple if I can somehow convert it to one of the supported types, but I got scared, the tuple is enormous and undocumented.

Any ideas how this could be done, do I overlook something, is this out of scope, or better done different way?

Fails on character 'Ł' or 'Ø'

Fails if text contains 'Ł' or 'Ø'.

File "/www/exhibition-backend/exhibitionenv/lib/python3.7/site-packages/czech_sort/impl.py", line 95, in key
char_lower, _extra_diacritics = DECOMPOSING_EXTRAS[char]
KeyError: 'Ł'
The problem is in impl.py on line 95, where is DECOMPOSING_EXTRAS[char], but in condition above is mentioned if char_lower in DECOMPOSING_EXTRAS:.

I expect that char_lower should be used on line 95: char_lower, _extra_diacritics = DECOMPOSING_EXTRAS[char_lower]

ČSN public?

I'm reading this status and wondering if that means the ČSN standard mentioned in the README here now becomes public & accessible? I didn't verify, just posting it here as a suggestion.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.