GithubHelp home page GithubHelp logo

natsort's Introduction

Natural Order String Comparison

Build

Martin Pool http://sourcefrog.net

Computer string sorting algorithms generally don't order strings containing numbers in the same way that a human would do. Consider:

    rfc1.txt
    rfc2086.txt
    rfc822.txt

It would be more friendly if the program listed the files as

    rfc1.txt
    rfc822.txt
    rfc2086.txt

Filenames sort properly if people insert leading zeros, but they don't always do that.

I've written a subroutine that compares strings according to this natural ordering. You can use this routine in your own software, or download a patch to add it to your favourite Unix program.

Sorting

Strings are sorted as usual, except that decimal integer substrings are compared on their numeric value. For example,

a < a0 < a1 < a1a < a1b < a2 < a10 < a20

Strings can contain several number parts:

x2-g8 < x2-y08 < x2-y7 < x8-y8

in which case numeric fields are separated by nonnumeric characters. Leading spaces are ignored. This works very well for IP addresses from log files, for example.

Leading zeros are not ignored, which tends to give more reasonable results on decimal fractions.

  1.001 < 1.002 < 1.010 < 1.02 < 1.1 < 1.3

Some applications may wish to change this by modifying the test that calls isspace.

Performance is linear: each character of the string is scanned at most once, and only as many characters as necessary to decide are considered.

Licensing

This software is copyright by Martin Pool, and made available under the same licence as zlib:

This software is provided 'as-is', without any express or implied warranty. In no event will the authors be held liable for any damages arising from the use of this software.

Permission is granted to anyone to use this software for any purpose, including commercial applications, and to alter it and redistribute it freely, subject to the following restrictions:

  1. The origin of this software must not be misrepresented; you must not claim that you wrote the original software. If you use this software in a product, an acknowledgment in the product documentation would be appreciated but is not required.

  2. Altered source versions must be plainly marked as such, and must not be misrepresented as being the original software.

  3. This notice may not be removed or altered from any source distribution.

This licence applies only to the C implementation. You are free to reimplement the idea fom scratch in any language.

Get It!

strnatcmp.c, strnatcmp.h - the algorithm itself

natsort.c - example driver program.

natcompare.js - Kristof Coomans wrote a natural sort comparison in Javascript.

natcmp.rb -- An implementation by Alan Davies in Ruby.

Related Work

POSIX sort(1) has the -n option to sort numbers, but this doesn't work if there is a non-numeric prefix.

GNU ls(1) has the --sort=version option, which works the same way.

The PHP scripting language now has a strnatcmp function based on this code. The PHP wrapper was done by Andrei Zimievsky.

Stuart Cheshire has a Macintosh system extension to do natural ordering. I indepdendently reinvented the algorithm, but Stuart had it first. I borrowed the term natural sort from him.

Sort::Versions in Perl. "The code has some special magic to deal with common conventions in program version numbers, like the difference between 'decimal' versions (eg perl 5.005) and the Unix kind (eg perl 5.6.1)."

Sort::Naturally is also in Perl, by Sean M. Burke. It uses locale-sensitive character classes to sort words and numeric substrings in a way similar to natsort.

Ed Avis wrote something similar in Haskell.

Pierre-Luc Paour wrote a NaturalOrderComparator in Java.

Numacomp - similar thing in Python.

as3natcompare implementation in Flash ActionScript 3.

To Do

Comparison of characters is purely numeric, without taking character set or locale into account. So it is only correct for ASCII. This should probably be a separate function because doing the comparisons will probably introduce a dependency on the OS mechanism for finding the locale and comparing characters.

It might be good to support multibyte character sets too.

If you fix either of these, please mail me. They should not be very hard.

natsort's People

Contributors

aklomp avatar iepiweidieng avatar paour avatar sourcefrog avatar tueddy avatar zfergus avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

natsort's Issues

Comparison fails with 00

First of all thanks for your work.

We use it in MaterialAudiobookPlayer to compare file names. But recently a problem came up.

When a String start with zeros, the ordering gets wrong. I wrote a small unit test to demonstrate it here.

String first = "00 I";
String second = "01 I";
assertTrue(NaturalOrderComparator.naturalCompare(first, second) < 0);

This one fails.

tests fail on trailing whitespace on osx

At 76b9abc:

% make test
ccache cc -Wall -g -Werror   -c -o strnatcmp.o strnatcmp.c
ccache cc -Wall -g -Werror   -c -o natsort.o natsort.c
ccache cc -o natsort strnatcmp.o natsort.o
--- sorted-words    2015-06-05 18:41:31.000000000 -0700
+++ /dev/fd/63  2015-08-22 11:41:54.000000000 -0700
@@ -13,8 +13,8 @@
 pic3
 pic4
 pic 4 else
-pic 5
 pic 5 
+pic 5
 pic 5 something
 pic 6
 pic   7
Test failed for input file 'test-words'
make: *** [test] Error 1

Fix issues found by `scan-build` static analysis

Clang's scan-build static analysis tool flags a few issues in natsort.c. These are easy enough to fix, so that analysis passes. This opens the road for running make analyze in CI as part of issue #8.

natsort.c:29: warning: "_GNU_SOURCE" redefined

While compiling your library with PlatformIO (ESP32) &these compiler flags:

build_flags =
    -std=c++17
    -std=gnu++17
    -Wall
    -Wextra
    -Wunreachable-code

I get some warnings:

.pio/libdeps/lolin_d32_pro_sdmmc_pe/natsort/natsort.c:29: warning: "_GNU_SOURCE" redefined
   29 | #define _GNU_SOURCE
      |
<command-line>: note: this is the location of the previous definition
.pio/libdeps/lolin_d32_pro_sdmmc_pe/natsort/natsort.c: In function 'main':
.pio/libdeps/lolin_d32_pro_sdmmc_pe/natsort/natsort.c:137:26: warning: implicit declaration of function 'getline' [-Wimplicit-function-declaration]
  137 |           if ((linelen = getline(&line, &bufsize, stdin)) <= 0)
      |                          ^~~~~~~

Any way to get rid this issue? Thanks!

Comparisons incorrect on leading zeros

The comment says:

The longest run of digits wins. That aside, the greatest
value wins, but we can't know that it will until we've scanned
both numbers to know that they have the same magnitude,
so we remember it in BIAS.

It should be the "longest run of digits ignoring leading zeros wins". E.g., given:

5
8
007

It should sort as 5, 007, 8 and not as 5, 8, 007.

Please note that I'm responding just based on a quick look at the source code.

Case-sensitive js sort

This is more of a comment than a problem.

The algorithm in JavaScript is case-sensitive (which means output will be ["a", "c", "B"] instead of ["a", "B", "c"].

A solution within the function could be at the very beginning of declaration:
function natcompare(a, b) { a = a?.toLowerCase(); b = b?.toLowerCase(); var ia = 0, ib = 0; ...

Or you have to call natcompare with lower-case text (natcompare(a..toLowerCase(), b.toLowerCase())

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.