Natural sorting: Difference between revisions

From Rosetta Code
Content added Content deleted
(+ D entry)
(Promote to full task from draft.)
Line 1: Line 1:
{{draft task}}
{{task}}
Natural sorting is the sorting of text that does more than rely on the
Natural sorting is the sorting of text that does more than rely on the
order of individual characters codes to make the finding of
order of individual characters codes to make the finding of

Revision as of 17:12, 16 April 2014

Task
Natural sorting
You are encouraged to solve this task according to the task description, using any language you may know.

Natural sorting is the sorting of text that does more than rely on the order of individual characters codes to make the finding of individual strings easier for a human reader.

There is no "one true way" to do this, but for the purpose of this task 'natural' orderings might include:

1. Ignore leading, trailing and multiple adjacent spaces
2. Make all whitespace characters equivalent.
3. Sorting without regard to case.
4. Sorting numeric portions of strings in numeric order. That is split the string into fields on numeric boundaries, then sort on each field, with the rightmost fields being the most significant, and numeric fields of integers treated as numbers.
foo9.txt before foo10.txt
As well as ... x9y99 before x9y100, before x10y0
... (for any number of groups of integers in a string).
5. Title sorts: without regard to a leading, very common, word such
as 'The' in "The thirty-nine steps".
6. Sort letters without regard to accents.
7. Sort ligatures as separate letters.
8. Replacements:
Sort german scharfes S (ß) as ss
Sort ſ, LATIN SMALL LETTER LONG S as s
Sort ʒ, LATIN SMALL LETTER EZH as s
...
Task Description
  • Implement the first four of the eight given features in a natural sorting routine/function/method...
  • Test each feature implemented separately with an ordered list of test strings from the 'Sample inputs' section below, and make sure your naturally sorted output is in the same order as other language outputs such as Python.
  • Print and display your output.
  • For extra credit implement more than the first four.

Note: It is not necessary to have individual control of which features are active in the natural sorting routine at any time.

Sample input
# Ignoring leading spaces
Text strings:
['ignore leading spaces: 2-2', ' ignore leading spaces: 2-1', '  ignore leading spaces: 2+0', '   ignore leading spaces: 2+1']

# Ignoring multiple adjacent spaces (m.a.s)
Text strings:
['ignore m.a.s spaces: 2-2', 'ignore m.a.s  spaces: 2-1', 'ignore m.a.s   spaces: 2+0', 'ignore m.a.s    spaces: 2+1']


# Equivalent whitespace characters
Text strings:
['Equiv. spaces: 3-3', 'Equiv.\rspaces: 3-2', 'Equiv.\x0cspaces: 3-1', 'Equiv.\x0bspaces: 3+0', 'Equiv.\nspaces: 3+1', 'Equiv.\tspaces: 3+2']

# Case Indepenent sort
Text strings:
['cASE INDEPENENT: 3-2', 'caSE INDEPENENT: 3-1', 'casE INDEPENENT: 3+0', 'case INDEPENENT: 3+1']

# Numeric fields as numerics
Text strings:
['foo100bar99baz0.txt', 'foo100bar10baz0.txt', 'foo1000bar99baz10.txt', 'foo1000bar99baz9.txt']

# Title sorts
Text strings:
['The Wind in the Willows', 'The 40th step more', 'The 39 steps', 'Wanda']

# Equivalent accented characters (and case)
Text strings:
[u'Equiv. \xfd accents: 2-2', u'Equiv. \xdd accents: 2-1', u'Equiv. y accents: 2+0', u'Equiv. Y accents: 2+1']


# Separated ligatures
Text strings:
[u'\u0132 ligatured ij', 'no ligature']

# Character replacements
Text strings:
[u'Start with an \u0292: 2-2', u'Start with an \u017f: 2-1', u'Start with an \xdf: 2+0', u'Start with an s: 2+1']


C

Some differences from task requirement:

  1. req 1 and 2 are not separated. I can't imagine a situation where I'd want one but not the other.
  2. req 5 is implemented differently: not only leading "the", but some common words like "it", "to" etc are discarded everywhere in the string
  3. req 6, 7, 8 are pretty much the same thing, so the implementation doesn't make too much of a distinction of it ("ß" is a ligature, you know.)

Besides the numeric part, everything else was done in a uniform way by transforming input strings into some normalized format and comparing those instead. All sort options flags can be freely mixed together. C source is written in UTF-8 for easier reading here: don't do this for serious code. <lang c>#include <stdio.h>

  1. include <stdlib.h>
  2. include <wchar.h>
  3. include <wctype.h>
  4. include <string.h>
  5. include <locale.h>

typedef struct wstr { wchar_t *s; int n, alloc; } wstr;

  1. define w_del(w) { free(w->s); free(w); }
  2. define forchars(i, c, w) for(i = 0, c = w->s[0]; i < w->n && c; c = w->s[++i])

wstr *w_new() { wstr *w = malloc(sizeof(wstr)); w->alloc = 1; w->n = 0; w->s = malloc(sizeof(wchar_t)); w->s[0] = 0; return w; }

void w_append(wstr *w, wchar_t c) { int n = w->n + 1; if (n >= w->alloc) { w->alloc *= 2; w->s = realloc(w->s, w->alloc * sizeof(wchar_t)); } w->s[w->n++] = c; w->s[w->n] = 0; }

wstr *w_make(wchar_t *s) { int i, len = wcslen(s); wstr *w = w_new(); for (i = 0; i < len; i++) w_append(w, s[i]); return w; }

typedef void (*wtrans_func)(wstr *, wstr *); void w_transform(wstr *in, wtrans_func f) { wstr t, *out = w_new(); f(in, out); t = *in; *in = *out; *out = t; w_del(out); }

  1. define transfunc(x) void w_##x(wstr *in, wstr *out)

transfunc(nocase) { int i; wchar_t c; forchars(i, c, in) w_append(out, towlower(c)); }

transfunc(despace) { int i, gotspace = 0; wchar_t c; forchars(i, c, in) { if (!iswspace(c)) { if (gotspace && out->n) w_append(out, L' '); w_append(out, c); gotspace = 0; } else gotspace = 1; } }

static const wchar_t *const tbl_accent[] = { /* copied from Perl6 code */ L"Þ", L"TH", L"þ", L"th", L"Ð", L"TH", L"ð", L"th", L"À", L"A", L"Á", L"A", L"Â", L"A", L"Ã", L"A", L"Ä", L"A", L"Å", L"A", L"à", L"a", L"á", L"a", L"â", L"a", L"ã", L"a", L"ä", L"a", L"å", L"a", L"Ç", L"C", L"ç", L"c", L"È", L"E", L"É", L"E", L"Ê", L"E", L"Ë", L"E", L"è", L"e", L"é", L"e", L"ê", L"e", L"ë", L"e", L"Ì", L"I", L"Í", L"I", L"Î", L"I", L"Ï", L"I", L"ì", L"i", L"í", L"i", L"î", L"i", L"ï", L"i", L"Ò", L"O", L"Ó", L"O", L"Ô", L"O", L"Õ", L"O", L"Ö", L"O", L"Ø", L"O", L"ò", L"o", L"ó", L"o", L"ô", L"o", L"õ", L"o", L"ö", L"o", L"ø", L"o", L"Ñ", L"N", L"ñ", L"n", L"Ù", L"U", L"Ú", L"U", L"Û", L"U", L"Ü", L"U", L"ù", L"u", L"ú", L"u", L"û", L"u", L"ü", L"u", L"Ý", L"Y", L"ÿ", L"y", L"ý", L"y" };

static const wchar_t *const tbl_ligature[] = { L"Æ", L"AE", L"æ", L"ae", L"ß", L"ss", L"ffl", L"ffl", L"ffi", L"ffi", L"fi", L"fi", L"ff", L"ff", L"fl", L"fl", L"ſ", L"s", L"ʒ", L"z", L"st", L"st", /* ... come on ... */ };

void w_char_repl(wstr *in, wstr *out, const wchar_t *const *tbl, int len) { int i, j, k; wchar_t c; forchars(i, c, in) { for (j = k = 0; j < len; j += 2) { if (c != tbl[j][0]) continue; for (k = 0; tbl[j + 1][k]; k++) w_append(out, tbl[j + 1][k]); break; } if (!k) w_append(out, c); } }

transfunc(noaccent) { w_char_repl(in, out, tbl_accent, sizeof(tbl_accent)/sizeof(wchar_t*)); }

transfunc(noligature) { w_char_repl(in, out, tbl_ligature, sizeof(tbl_ligature)/sizeof(wchar_t*)); }

static const wchar_t *const tbl_article[] = { L"the", L"a", L"of", L"to", L"is", L"it" };

  1. define N_ARTICLES sizeof(tbl_article)/sizeof(tbl_article[0])

transfunc(noarticle) { int i, j, n; wchar_t c, c0 = 0; forchars(i, c, in) { if (!c0 || (iswalnum(c) && !iswalnum(c0))) { /* word boundary */ for (j = N_ARTICLES - 1; j >= 0; j--) { n = wcslen(tbl_article[j]); if (wcsncasecmp(in->s + i, tbl_article[j], n)) continue; if (iswalnum(in->s[i + n])) continue; i += n; break; } if (j < 0) w_append(out, c); } else w_append(out, c); c0 = c; } }

enum { wi_space = 0, wi_case, wi_accent, wi_lig, wi_article, wi_numeric };

  1. define WS_NOSPACE (1 << wi_space)
  2. define WS_NOCASE (1 << wi_case)
  3. define WS_ACCENT (1 << wi_accent)
  4. define WS_LIGATURE (1 << wi_lig)
  5. define WS_NOARTICLE (1 << wi_article)
  6. define WS_NUMERIC (1 << wi_numeric)

const wtrans_func trans_funcs[] = { w_despace, w_nocase, w_noaccent, w_noligature, w_noarticle, 0 }; const char *const flagnames[] = { "collapse spaces", "case insensitive", "disregard accent", "decompose ligatures", "discard common words", "numeric", };

typedef struct { wchar_t* s; wstr *w; } kw_t; int w_numcmp(const void *a, const void *b) { wchar_t *pa = ((const kw_t*)a)->w->s, *pb = ((const kw_t*)b)->w->s; int sa, sb, ea, eb; while (*pa && *pb) { if (iswdigit(*pa) && iswdigit(*pb)) { /* skip leading zeros */ sa = sb = 0; while (pa[sa] == L'0') sa++; while (pb[sb] == L'0') sb++; /* find end of numbers */ ea = sa; eb = sb; while (iswdigit(pa[ea])) ea++; while (iswdigit(pb[eb])) eb++; if (eb - sb > ea - sa) return -1; if (eb - sb < ea - sa) return 1; while (sb < eb) { if (pa[sa] > pb[sb]) return 1; if (pa[sa] < pb[sb]) return -1; sa++; sb++; }

pa += ea; pb += eb; } else if (iswdigit(*pa)) return 1; else if (iswdigit(*pb)) return -1; else { if (*pa > *pb) return 1; if (*pa < *pb) return -1; pa++; pb++; } } return (!*pa && !*pb) ? 0 : *pa ? 1 : -1; }

int w_cmp(const void *a, const void *b) { return wcscmp(((const kw_t*)a)->w->s, ((const kw_t*)b)->w->s); }

void natural_sort(wchar_t **strings, int len, int flags) { int i, j; kw_t *kws = malloc(sizeof(kw_t) * len);

for (i = 0; i < len; i++) { kws[i].s = strings[i]; kws[i].w = w_make(strings[i]); for (j = 0; j < wi_numeric; j++) if (flags & (1 << j) && trans_funcs[j]) w_transform(kws[i].w, trans_funcs[j]); }

qsort(kws, len, sizeof(kw_t), (flags & WS_NUMERIC) ? w_numcmp : w_cmp); for (i = 0; i < len; i++) { w_del(kws[i].w); strings[i] = kws[i].s; } free(kws); }

const wchar_t *const test[] = { L" 0000098 nina", L"100 niño", L"99 Ninja", L"100 NINA", L" The work is so difficult to do it took ſome 100 aeons. ", L"The work is so difficult it took some 100 aeons.", L" The work is so difficult it took ſome 99 æons. ", };

  1. define N_STRINGS sizeof(test)/sizeof(*test)

void test_sort(int flags) { int i, j; const wchar_t *str[N_STRINGS]; memcpy(str, test, sizeof(test));

printf("Sort flags: ("); for (i = 0, j = flags; j; i++, j >>= 1) if ((j & 1)) printf("%s%s", flagnames[i], j > 1 ? ", ":")\n");

natural_sort((wchar_t **)str, N_STRINGS, flags);

for (i = 0; i < N_STRINGS; i++) printf("%ls\n", str[i]); printf("\n"); }

int main() { setlocale(LC_CTYPE, "");

test_sort(WS_NOSPACE); test_sort(WS_NOCASE); test_sort(WS_NUMERIC); test_sort(WS_NOARTICLE|WS_NOSPACE); test_sort(WS_NOCASE|WS_NOSPACE|WS_ACCENT); test_sort(WS_LIGATURE|WS_NOCASE|WS_NOSPACE|WS_NUMERIC|WS_ACCENT|WS_NOARTICLE);

return 0; }</lang>output<lang>Sort flags: (collapse spaces)

0000098 nina

100 NINA 100 niño 99 Ninja The work is so difficult it took some 100 aeons.

 The work is so difficult   it took ſome 99 æons.  
The work is so difficult to do it took ſome 100 aeons.  

Sort flags: (case insensitive)

 The work is so difficult   it took ſome 99 æons.  
0000098 nina
The work is so difficult to do it took ſome 100 aeons.  

100 NINA 100 niño 99 Ninja The work is so difficult it took some 100 aeons.

Sort flags: (numeric)

 The work is so difficult   it took ſome 99 æons.  
The work is so difficult to do it took ſome 100 aeons.  
0000098 nina

The work is so difficult it took some 100 aeons. 99 Ninja 100 NINA 100 niño

Sort flags: (collapse spaces, discard common words)

0000098 nina

100 NINA 100 niño 99 Ninja The work is so difficult it took some 100 aeons.

The work is so difficult to do it took ſome 100 aeons.  
 The work is so difficult   it took ſome 99 æons.  

Sort flags: (collapse spaces, case insensitive, disregard accent)

0000098 nina

100 NINA 100 niño 99 Ninja The work is so difficult it took some 100 aeons.

 The work is so difficult   it took ſome 99 æons.  
The work is so difficult to do it took ſome 100 aeons.  

Sort flags: (collapse spaces, case insensitive, disregard accent, decompose ligatures, discard common words, numeric)

The work is so difficult to do it took ſome 100 aeons.  
 The work is so difficult   it took ſome 99 æons.  

The work is so difficult it took some 100 aeons.

0000098 nina

99 Ninja 100 NINA 100 niño</lang>

D

Implements requests 1-5. <lang d>import std.stdio, std.string, std.algorithm, std.array, std.conv,

      std.ascii, std.range;

string[] naturalSort(string[] arr) {

   static struct Part {
       string s;
       bool isNumber = false;
       ulong n;
       int opCmp(in ref Part other) const pure {
           return (isNumber && other.isNumber) ?
                  cmp([n], [other.n]) : cmp(s, other.s);
       }
   }
   static Part[] mapper(in string s) pure {
       // groupBy!isDigit could shorten this function.
       enum Kind { nothing, digit, notDigit }
       auto lk = Kind.nothing; // Last Kind.
       string[] parts;
       foreach (c; s.strip.tr(whitespace, " ", "s").toLower) {
           if (lk != Kind.nothing && c.isDigit == (lk == Kind.digit))
               parts.back ~= c;
           else
               parts ~= [c];
           lk = c.isDigit ? Kind.digit : Kind.notDigit;
       }
       auto r = parts.map!(p => p[0].isDigit ?
                           Part(p, true, p.to!ulong) :
                           Part(p)).array;
       return (r.length > 1 && r[0].s == "the") ? r.dropOne : r;
   }
   return arr.schwartzSort!mapper.release;

}

void main() {

   auto tests = [
   // Ignoring leading spaces.
   ["ignore leading spaces: 2-2", " ignore leading spaces: 2-1", "
    ignore leading spaces: 2+1", "  ignore leading spaces: 2+0"],
   // Ignoring multiple adjacent spaces (m.a.s).
   ["ignore m.a.s spaces: 2-2", "ignore m.a.s  spaces: 2-1",
    "ignore m.a.s   spaces: 2+0", "ignore m.a.s    spaces: 2+1"],
   // Equivalent whitespace characters.
   ["Equiv. spaces: 3-3", "Equiv.\rspaces: 3-2",
    "Equiv.\x0cspaces: 3-1", "Equiv.\x0bspaces: 3+0",
    "Equiv.\nspaces: 3+1", "Equiv.\tspaces: 3+2"],
   // Case Indepenent sort.
   ["cASE INDEPENENT: 3-2", "caSE INDEPENENT: 3-1",
    "casE INDEPENENT: 3+0", "case INDEPENENT: 3+1"],
   // Numeric fields as numerics.
   ["foo100bar99baz0.txt", "foo100bar10baz0.txt",
    "foo1000bar99baz10.txt", "foo1000bar99baz9.txt"],
   // Title sorts.
   ["The Wind in the Willows", "The 40th step more",
    "The 39 steps", "Wanda"]];
   foreach (test; tests) {
       test.writeln;
       writeln(test.naturalSort, "\n");
   }

}</lang>

Output:
["ignore leading spaces: 2-2", " ignore leading spaces: 2-1", "   ignore leading spaces: 2+1", "  ignore leading spaces: 2+0"]
["  ignore leading spaces: 2+0", "   ignore leading spaces: 2+1", " ignore leading spaces: 2-1", "ignore leading spaces: 2-2"]

["ignore m.a.s spaces: 2-2", "ignore m.a.s  spaces: 2-1", "ignore m.a.s   spaces: 2+0", "ignore m.a.s    spaces: 2+1"]
["ignore m.a.s   spaces: 2+0", "ignore m.a.s    spaces: 2+1", "ignore m.a.s  spaces: 2-1", "ignore m.a.s spaces: 2-2"]

["Equiv. spaces: 3-3", "Equiv.\rspaces: 3-2", "Equiv.\fspaces: 3-1", "Equiv.\vspaces: 3+0", "Equiv.\nspaces: 3+1", "Equiv.\tspaces: 3+2"]
["Equiv.\vspaces: 3+0", "Equiv.\nspaces: 3+1", "Equiv.\tspaces: 3+2", "Equiv.\fspaces: 3-1", "Equiv.\rspaces: 3-2", "Equiv. spaces: 3-3"]

["cASE INDEPENENT: 3-2", "caSE INDEPENENT: 3-1", "casE INDEPENENT: 3+0", "case INDEPENENT: 3+1"]
["casE INDEPENENT: 3+0", "case INDEPENENT: 3+1", "caSE INDEPENENT: 3-1", "cASE INDEPENENT: 3-2"]

["foo100bar99baz0.txt", "foo100bar10baz0.txt", "foo1000bar99baz10.txt", "foo1000bar99baz9.txt"]
["foo100bar10baz0.txt", "foo100bar99baz0.txt", "foo1000bar99baz9.txt", "foo1000bar99baz10.txt"]

["The Wind in the Willows", "The 40th step more", "The 39 steps", "Wanda"]
["The 39 steps", "The 40th step more", "The Wind in the Willows", "Wanda"]

J

The natural way of approaching this task in J is to normalize the text based on the rules desired for sorting. Here, we limit ourselves to ascii, for portability, and decide that our domain shall be terminated strings (where each string ends with the same character):

<lang j>require'strings regex'

lines=: <;.2 titleFix=: ('^\s*(the|a|an)\b';)&rxrplc isNum=: e.&'0123456789' num=: ".^:(isNum@{.) split=: <@num/.~ [:+/\1,2 ~:/\ isNum norm=: [: split (32 9 12 13 14 15{a.) -.~ [: titleFix tolower

natSor=: lines ;@/: norm&.>@lines</lang>

Example data:

<lang j>IgnoringLeadingSpaces=:0 :0 ignore leading spaces: 2-2

ignore leading spaces: 2-1
 ignore leading spaces: 2+0
  ignore leading spaces: 2+1

)

IgnoringMultipleAdjacentSpaces=: 0 :0

ignore m.a.s spaces: 2-2
ignore m.a.s  spaces: 2-1
ignore m.a.s   spaces: 2+0
ignore m.a.s    spaces: 2+1

)

bsSubst=: rplc&((LF;'\'),('\r';13{a.),('\x0c';12{a.),('\x0b';11{a.),('\n';LF),:'\t';TAB)

EquivalentWhitespaceCharacters=: bsSubst 0 :0

Equiv. spaces: 3-3
Equiv.\rspaces: 3-2
Equiv.\x0cspaces: 3-1
Equiv.\x0bspaces: 3+0
Equiv.\nspaces: 3+1
Equiv.\tspaces: 3+2

)

CaseIndepenent=: 0 :0

cASE INDEPENENT: 3-2
caSE INDEPENENT: 3-1
casE INDEPENENT: 3+0
case INDEPENENT: 3+1

)

NumericFieldsAsNumerics=: 0 :0

foo100bar99baz0.txt
foo100bar10baz0.txt
foo1000bar99baz10.txt
foo1000bar99baz9.txt

)

Titles=: 0 :0

The Wind in the Willows
The 40th step more
The 39 steps
Wanda

)</lang>

Note that the required example which contains equivalent whitespace characters includes a '\n' in the data. So, for that example, we use a backslash as our terminator.

Example use:

<lang j> natSor IgnoringLeadingSpaces

 ignore leading spaces: 2+0
  ignore leading spaces: 2+1
ignore leading spaces: 2-1

ignore leading spaces: 2-2

  natSor IgnoringMultipleAdjacentSpaces
ignore m.a.s   spaces: 2+0
ignore m.a.s    spaces: 2+1
ignore m.a.s  spaces: 2-1
ignore m.a.s spaces: 2-2
  natSor EquivalentWhitespaceCharacters
Equiv.

spaces: 3+1\ Equiv.�spaces: 3+0\ Equiv. spaces: 3+2\ Equiv.�spaces: 3-1\ Equiv. spaces: 3-2\ Equiv. spaces: 3-3\

  natSor CaseIndepenent
casE INDEPENENT: 3+0
case INDEPENENT: 3+1
caSE INDEPENENT: 3-1
cASE INDEPENENT: 3-2
  natSor NumericFieldsAsNumerics
foo100bar10baz0.txt
foo100bar99baz0.txt
foo1000bar99baz9.txt
foo1000bar99baz10.txt
  natSor Titles
The 39 steps
The 40th step more
Wanda
The Wind in the Willows

</lang>

Perl 6

Tested with Rakudo 2011.06

In Perl6 it is very easy to modify the default sorting order by passing in a transform routine to the sort function. If the transform routines are arity one, the sort function will apply a Schwartzian Transform so it only needs to calculate the transform once. Note that the transforms are non-destructive; The sort function returns the original strings.

The following are a series of subroutines to perform the various natural sorting transforms. They may be applied individually or mixed and matched to get the particular result desired. When more than one is strung together, they apply left to right. Some combinations may yield different results depending on the order they are applied.

<lang perl6># Sort groups of digits in number order. Sort by order of magnitude then lexically. sub naturally ($a) { $a.lc.subst(/(\d+)/, ->$/ {0~$0.chars.chr~$0},:g) ~"\x0"~$a }

  1. Collapse multiple ws characters to a single.

sub collapse ($a) { $a.subst( / ( \s ) $0+ /, -> $/ { $0 }, :g ) }

  1. Convert all ws characters to a space.

sub normalize ($a) { $a.subst( / ( \s ) /, ' ', :g ) }

  1. Ignore common leading articles for title sorts

sub title ($a) { $a.subst( / :i ^ ( a | an | the ) >> \s* /, ) }

  1. Decompose ISO-Latin1 glyphs to their base character.

sub latin1_decompose ($a) {

   my %tr = < 
      Æ AE æ ae Þ TH þ th Ð TH ð th ß ss À A Á A Â A Ã A Ä A Å A à a á a
       â a ã a ä a å a Ç C ç c È E É E Ê E Ë E è e é e ê e ë e Ì I Í I Î
       I Ï I ì i í i î i ï i Ò O Ó O Ô O Õ O Ö O Ø O ò o ó o ô o õ o ö o
       ø o Ñ N ñ n Ù U Ú U Û U Ü U ù u ú u û u ü u Ý Y ÿ y ý y
   >;
  1. Would probably be better implemented as
  2. $a.trans( [%tr.keys] => [%tr.values] );
  3. but the underlying Parrot VM leaks through in current Rakudo.
   my $re = '<[' ~ %tr.keys.join('|') ~ ']>';
   $a.subst(/ (<$re>) /, -> $/ { %tr{$0} }, :g)

}</lang>


Used as:

<lang perl6>my @tests = (

   [
       "Task 1a\nSort while ignoring leading spaces.",
       [
         'ignore leading spaces: 1', '   ignore leading spaces: 4',
         '  ignore leading spaces: 3', ' ignore leading spaces: 2'
       ],
       {.trim} # builtin method.
   ],
   [
       "Task 1b\nSort while ignoring multiple adjacent spaces.",
       [
         'ignore m.a.s   spaces: 3', 'ignore m.a.s spaces: 1',
         'ignore m.a.s    spaces: 4', 'ignore m.a.s  spaces: 2'
       ],
       {.&collapse}
   ],
   [
       "Task 2\nSort with all white space normalized to regular spaces.",
       [
         "Normalized\tspaces: 4", "Normalized\xa0spaces: 1",
         "Normalized\x20spaces: 2", "Normalized\nspaces: 3"
       ],
       {.&normalize}
   ],
   [
       "Task 3\nSort case independently.",
       [
         'caSE INDEPENDENT: 3', 'casE INDEPENDENT: 2',
         'cASE INDEPENDENT: 4', 'case INDEPENDENT: 1'
       ],
       {.lc} # builtin method
   ],
   [
       "Task 4\nSort groups of digits in natural number order.",
       [
         <Foo100bar99baz0.txt foo100bar10baz0.txt foo1000bar99baz10.txt
          foo1000bar99baz9.txt 201st 32nd 3rd 144th 17th 2 95>
       ],
       {.&naturally}
   ],
   [
       "Task 5 ( mixed with 1, 2, 3 & 4 )\n"
       ~ "Sort titles, normalize white space, collapse multiple spaces to\n"
       ~ "single, trim leading white space, ignore common leading articles\n"
       ~ 'and sort digit groups in natural order.',
       [
         'The Wind	in the Willows  8', '  The 39 Steps               3',
         'The    7th Seal              1', 'Wanda                        6',
         'A Fish Called Wanda          5', ' The Wind and the Lion       7',
         'Any Which Way But Loose      4', '12 Monkeys                   2'
       ],
       {.&normalize.&collapse.trim.&title.&naturally}
   ],
   [
       "Task 6, 7, 8\nMap letters in Latin1 that have accents or decompose to two\n"
       ~ 'characters to their base characters for sorting.',
       [
         <apple Ball bald car Card above Æon æon aether
           niño nina e-mail Évian evoke außen autumn>
       ],
       {.&latin1_decompose.&naturally}
   ]

);


for @tests -> $case {

   my $code_ref = $case.pop;
   my @array = $case.pop.list;
   say $case.pop, "\n";
   say "Standard Sort:\n";
   .say for @array.sort;
   say "\nNatural Sort:\n";
   .say for @array.sort: {.$code_ref};
   say "\n" ~ '*' x 40 ~ "\n";

}</lang>

Sample output:

Task 1a
Sort while ignoring leading spaces.

Standard Sort:

   ignore leading spaces: 4
  ignore leading spaces: 3
 ignore leading spaces: 2
ignore leading spaces: 1

Natural Sort:

ignore leading spaces: 1
 ignore leading spaces: 2
  ignore leading spaces: 3
   ignore leading spaces: 4

****************************************

Task 1b
Sort while ignoring multiple adjacent spaces.

Standard Sort:

ignore m.a.s    spaces: 4
ignore m.a.s   spaces: 3
ignore m.a.s  spaces: 2
ignore m.a.s spaces: 1

Natural Sort:

ignore m.a.s spaces: 1
ignore m.a.s  spaces: 2
ignore m.a.s   spaces: 3
ignore m.a.s    spaces: 4

****************************************

Task 2
Sort with all white space normalized to regular spaces.

Standard Sort:

Normalized	spaces: 4
Normalized
spaces: 3
Normalized spaces: 2
Normalized spaces: 1

Natural Sort:

Normalized spaces: 1
Normalized spaces: 2
Normalized
spaces: 3
Normalized	spaces: 4

****************************************

Task 3
Sort case independently.

Standard Sort:

cASE INDEPENDENT: 4
caSE INDEPENDENT: 3
casE INDEPENDENT: 2
case INDEPENDENT: 1

Natural Sort:

case INDEPENDENT: 1
casE INDEPENDENT: 2
caSE INDEPENDENT: 3
cASE INDEPENDENT: 4

****************************************

Task 4
Sort groups of digits in natural number order.

Standard Sort:

144th
17th
2
201st
32nd
3rd
95
Foo100bar99baz0.txt
foo1000bar99baz10.txt
foo1000bar99baz9.txt
foo100bar10baz0.txt

Natural Sort:

2
3rd
17th
32nd
95
144th
201st
foo100bar10baz0.txt
Foo100bar99baz0.txt
foo1000bar99baz9.txt
foo1000bar99baz10.txt

****************************************

Task 5 ( mixed with 1, 2, 3 & 4 )
Sort titles, normalize white space, collapse multiple spaces to
single, trim leading white space, ignore common leading articles
and sort digit groups in natural order.

Standard Sort:

  The 39 Steps               3
 The Wind and the Lion       7
12 Monkeys                   2
A Fish Called Wanda          5
Any Which Way But Loose      4
The    7th Seal              1
The Wind	in the Willows  8
Wanda                        6

Natural Sort:

The    7th Seal              1
12 Monkeys                   2
  The 39 Steps               3
Any Which Way But Loose      4
A Fish Called Wanda          5
Wanda                        6
 The Wind and the Lion       7
The Wind	in the Willows  8

****************************************

Task 6, 7, 8
Map letters in Latin1 that have accents or decompose to two
characters to their base characters for sorting.

Standard Sort:

Ball
Card
above
aether
apple
autumn
außen
bald
car
e-mail
evoke
nina
niño
Æon
Évian
æon

Natural Sort:

above
Æon
æon
aether
apple
außen
autumn
bald
Ball
car
Card
e-mail
Évian
evoke
nina
niño

****************************************

PicoLisp

This parser takes care of features 1,2,3,4,5 and 8: <lang PicoLisp>(de parseNatural (Str)

  (clip
     (make
        (for (L (chop Str)  L)
           (cond
              ((sp? (car L))
                 (link " ")
                 (while (and L (sp? (car L)))
                    (pop 'L) ) )
              ((>= "9" (car L) "0")
                 (link
                    (format
                       (make
                          (loop
                             (link (pop 'L))
                             (NIL (>= "9" (car L) "0")) ) ) ) ) )
              (T
                 (let Word
                    (pack
                       (replace
                          (make
                             (loop
                                (link (lowc (pop 'L)))
                                (NIL L)
                                (T (sp? (car L)))
                                (T (>= "9" (car L) "0")) ) )
                           "ß" "ss" "ſ" "s" "ʒ" "s" ) )
                    (unless (member Word '(the it to))
                       (link Word) ) ) ) ) ) ) ) )</lang>

Test: <lang PicoLisp>: (parseNatural " ^MThe abc123Defß ^I Ghi ") -> ("abc" 123 "defss" " " "ghi")</lang> Sorting is trivial then: <lang PicoLisp>(de naturalSort (Lst)

  (by parseNatural sort Lst) )</lang>

Test: <lang PicoLisp>(de *TestData

  "# Ignoring leading spaces"
  ("ignore leading spaces: 2-2" " ignore leading spaces: 2-1" "  ignore leading spaces: 2+0" "   ignore leading spaces: 2+1")
  "# Ignoring multiple adjacent spaces (m.a.s)"
  ("ignore m.a.s spaces: 2-2" "ignore m.a.s  spaces: 2-1" "ignore m.a.s   spaces: 2+0" "ignore m.a.s    spaces: 2+1")
  "# Equivalent whitespace characters"
  ("Equiv. spaces: 3-3" "Equiv.^Mspaces: 3-2" "Equiv.^Acspaces: 3-1" "Equiv.^Kbspaces: 3+0" "Equiv.^Jspaces: 3+1" "Equiv.^Ispaces: 3+2")
  "# Case Indepenent sort"
  ("cASE INDEPENENT: 3-2" "caSE INDEPENENT: 3-1" "casE INDEPENENT: 3+0" "case INDEPENENT: 3+1")
  "# Numeric fields as numerics"
  ("foo100bar99baz0.txt" "foo100bar10baz0.txt" "foo1000bar99baz10.txt" "foo1000bar99baz9.txt")
  "# Title sorts"
  ("The Wind in the Willows" "The 40th step more" "The 39 steps" "Wanda")
  "# Equivalent accented characters (and case)"
  ("Equiv. ý accents: 2-2" "Equiv. Ý accents: 2-1" "Equiv. y accents: 2+0" "Equiv. Y accents: 2+1")
  # "Separated ligatures"
  ### ("IJ ligatured ij" "no ligature")
  "# Character replacements"
  ("Start with an ʒ: 2-2" "Start with an ſ: 2-1" "Start with an ß: 2+0" "Start with an s: 2+1") )

(de pythonOut (Ttl Lst)

  (prinl Ttl)
  (prin "['" (car Lst))
  (for S (cdr Lst)
     (prin "',^J '" S) )
  (prinl "']") )

(for X *TestData

  (if (atom X)
     (prinl X)
     (pythonOut "Text strings:" X)
     (pythonOut "Normally sorted :" (sort (copy X)))
     (pythonOut "Naturally sorted:" (naturalSort X))
     (prinl) ) )</lang>

Output:

# Ignoring leading spaces
Text strings:
['ignore leading spaces: 2-2',
 ' ignore leading spaces: 2-1',
 '  ignore leading spaces: 2+0',
 '   ignore leading spaces: 2+1']
Normally sorted :
['   ignore leading spaces: 2+1',
 '  ignore leading spaces: 2+0',
 ' ignore leading spaces: 2-1',
 'ignore leading spaces: 2-2']
Naturally sorted:
['  ignore leading spaces: 2+0',
 '   ignore leading spaces: 2+1',
 ' ignore leading spaces: 2-1',
 'ignore leading spaces: 2-2']

# Ignoring multiple adjacent spaces (m.a.s)
Text strings:
['ignore m.a.s spaces: 2-2',
 'ignore m.a.s  spaces: 2-1',
 'ignore m.a.s   spaces: 2+0',
 'ignore m.a.s    spaces: 2+1']
Normally sorted :
['ignore m.a.s    spaces: 2+1',
 'ignore m.a.s   spaces: 2+0',
 'ignore m.a.s  spaces: 2-1',
 'ignore m.a.s spaces: 2-2']
Naturally sorted:
['ignore m.a.s   spaces: 2+0',
 'ignore m.a.s    spaces: 2+1',
 'ignore m.a.s  spaces: 2-1',
 'ignore m.a.s spaces: 2-2']

# Equivalent whitespace characters
Text strings:
['Equiv. spaces: 3-3',
 'Equiv.
spaces: 3-2',
 'Equiv.�cspaces: 3-1',
 'Equiv.�bspaces: 3+0',
 'Equiv.
spaces: 3+1',
 'Equiv.	spaces: 3+2']
Normally sorted :
['Equiv.�cspaces: 3-1',
 'Equiv.	spaces: 3+2',
 'Equiv.
spaces: 3+1',
 'Equiv.�bspaces: 3+0',
 'Equiv.
spaces: 3-2',
 'Equiv. spaces: 3-3']
Naturally sorted:
['Equiv.�bspaces: 3+0',
 'Equiv.�cspaces: 3-1',
 'Equiv.
spaces: 3+1',
 'Equiv.	spaces: 3+2',
 'Equiv.
spaces: 3-2',
 'Equiv. spaces: 3-3']

# Case Indepenent sort
Text strings:
['cASE INDEPENENT: 3-2',
 'caSE INDEPENENT: 3-1',
 'casE INDEPENENT: 3+0',
 'case INDEPENENT: 3+1']
Normally sorted :
['cASE INDEPENENT: 3-2',
 'caSE INDEPENENT: 3-1',
 'casE INDEPENENT: 3+0',
 'case INDEPENENT: 3+1']
Naturally sorted:
['casE INDEPENENT: 3+0',
 'case INDEPENENT: 3+1',
 'caSE INDEPENENT: 3-1',
 'cASE INDEPENENT: 3-2']

# Numeric fields as numerics
Text strings:
['foo100bar99baz0.txt',
 'foo100bar10baz0.txt',
 'foo1000bar99baz10.txt',
 'foo1000bar99baz9.txt']
Normally sorted :
['foo1000bar99baz10.txt',
 'foo1000bar99baz9.txt',
 'foo100bar10baz0.txt',
 'foo100bar99baz0.txt']
Naturally sorted:
['foo100bar10baz0.txt',
 'foo100bar99baz0.txt',
 'foo1000bar99baz9.txt',
 'foo1000bar99baz10.txt']

# Title sorts
Text strings:
['The Wind in the Willows',
 'The 40th step more',
 'The 39 steps',
 'Wanda']
Normally sorted :
['The 39 steps',
 'The 40th step more',
 'The Wind in the Willows',
 'Wanda']
Naturally sorted:
['The 39 steps',
 'The 40th step more',
 'Wanda',
 'The Wind in the Willows']

# Equivalent accented characters (and case)
Text strings:
['Equiv. ý accents: 2-2',
 'Equiv. Ý accents: 2-1',
 'Equiv. y accents: 2+0',
 'Equiv. Y accents: 2+1']
Normally sorted :
['Equiv. Y accents: 2+1',
 'Equiv. y accents: 2+0',
 'Equiv. Ý accents: 2-1',
 'Equiv. ý accents: 2-2']
Naturally sorted:
['Equiv. y accents: 2+0',
 'Equiv. Y accents: 2+1',
 'Equiv. Ý accents: 2-1',
 'Equiv. ý accents: 2-2']

# Character replacements
Text strings:
['Start with an ʒ: 2-2',
 'Start with an ſ: 2-1',
 'Start with an ß: 2+0',
 'Start with an s: 2+1']
Normally sorted :
['Start with an s: 2+1',
 'Start with an ß: 2+0',
 'Start with an ſ: 2-1',
 'Start with an ʒ: 2-2']
Naturally sorted:
['Start with an s: 2+1',
 'Start with an ſ: 2-1',
 'Start with an ʒ: 2-2',
 'Start with an ß: 2+0']

Python

All eight features: <lang python># -*- coding: utf-8 -*-

  1. Not Python 3.x (Can't compare str and int)


from itertools import groupby from unicodedata import decomposition, name from pprint import pprint as pp

commonleaders = ['the'] # lowercase leading words to ignore replacements = {u'ß': 'ss', # Map single char to replacement string

               u'ſ': 's',
               u'ʒ': 's',
               }

hexdigits = set('0123456789abcdef') decdigits = set('0123456789') # Don't use str.isnumeric

def splitchar(c):

   ' De-ligature. De-accent a char'
   de = decomposition(c)
   if de:
       # Just the words that are also hex numbers
       de = [d for d in de.split()
                 if all(c.lower()
                        in hexdigits for c in d)]
       n = name(c, c).upper()
       # (Gosh it's onerous)
       if len(de)> 1 and 'PRECEDE' in n:
           # E.g. ʼn  LATIN SMALL LETTER N PRECEDED BY APOSTROPHE
           de[1], de[0] = de[0], de[1]
       tmp = [ unichr(int(k, 16)) for k in de]
       base, others = tmp[0], tmp[1:]
       if 'LIGATURE' in n:
           # Assume two character ligature
           base += others.pop(0)
   else:
       base = c
   return base
   

def sortkeygen(s):

   Generate 'natural' sort key for s
   Doctests:
       >>> sortkeygen('  some extra    spaces  ')
       [u'some extra spaces']
       >>> sortkeygen('CasE InseNsItIve')
       [u'case insensitive']
       >>> sortkeygen('The Wind in the Willows')
       [u'wind in the willows']
       >>> sortkeygen(u'\462 ligature')
       [u'ij ligature']
       >>> sortkeygen(u'\335\375 upper/lower case Y with acute accent')
       [u'yy upper/lower case y with acute accent']
       >>> sortkeygen('foo9.txt')
       [u'foo', 9, u'.txt']
       >>> sortkeygen('x9y99')
       [u'x', 9, u'y', 99]
   
   # Ignore leading and trailing spaces
   s = unicode(s).strip()
   # All space types are equivalent
   s = ' '.join(s.split())
   # case insentsitive
   s = s.lower()
   # Title
   words = s.split()
   if len(words) > 1 and words[0] in commonleaders:
       s = ' '.join( words[1:])
   # accent and ligatures
   s = .join(splitchar(c) for c in s)
   # Replacements (single char replaced by one or more)
   s = .join( replacements.get(ch, ch) for ch in s )
   # Numeric sections as numerics
   s = [ int("".join(g)) if isinteger else "".join(g)
         for isinteger,g in groupby(s, lambda x: x in decdigits)]
   return s

def naturalsort(items):

    Naturally sort a series of strings
   Doctests:
       >>> naturalsort(['The Wind in the Willows','The 40th step more',
                        'The 39 steps', 'Wanda'])
       ['The 39 steps', 'The 40th step more', 'Wanda', 'The Wind in the Willows']
   
   return sorted(items, key=sortkeygen)

if __name__ == '__main__':

   import string
   
   ns = naturalsort
   print '\n# Ignoring leading spaces'
   txt = ['%signore leading spaces: 2%+i' % (' '*i, i-2) for i in range(4)]
   print 'Text strings:'; pp(txt)
   print 'Normally sorted :'; pp(sorted(txt))
   print 'Naturally sorted:'; pp(ns(txt))
   print '\n# Ignoring multiple adjacent spaces (m.a.s)'
   txt = ['ignore m.a.s%s spaces: 2%+i' % (' '*i, i-2) for i in range(4)]
   print 'Text strings:'; pp(txt)
   print 'Normally sorted :'; pp(sorted(txt))
   print 'Naturally sorted:'; pp(ns(txt))
   print '\n# Equivalent whitespace characters'
   txt = ['Equiv.%sspaces: 3%+i' % (ch, i-3)
          for i,ch in enumerate(reversed(string.whitespace))]
   print 'Text strings:'; pp(txt)
   print 'Normally sorted :'; pp(sorted(txt))
   print 'Naturally sorted:'; pp(ns(txt))
   print '\n# Case Indepenent sort'
   s = 'CASE INDEPENENT'
   txt = [s[:i].lower() + s[i:] + ': 3%+i' % (i-3) for i in range(1,5)]
   print 'Text strings:'; pp(txt)
   print 'Normally sorted :'; pp(sorted(txt))
   print 'Naturally sorted:'; pp(ns(txt))
   print '\n# Numeric fields as numerics'
   txt = ['foo100bar99baz0.txt', 'foo100bar10baz0.txt',
          'foo1000bar99baz10.txt', 'foo1000bar99baz9.txt']
   print 'Text strings:'; pp(txt)
   print 'Normally sorted :'; pp(sorted(txt))
   print 'Naturally sorted:'; pp(ns(txt))
   print '\n# Title sorts'
   txt = ['The Wind in the Willows','The 40th step more',
                        'The 39 steps', 'Wanda']
   print 'Text strings:'; pp(txt)
   print 'Normally sorted :'; pp(sorted(txt))
   print 'Naturally sorted:'; pp(ns(txt))
   print '\n# Equivalent accented characters (and case)'
   txt = ['Equiv. %s accents: 2%+i' % (ch, i-2)
          for i,ch in enumerate(u'\xfd\xddyY')]
   print 'Text strings:'; pp(txt)
   print 'Normally sorted :'; pp(sorted(txt))
   print 'Naturally sorted:'; pp(ns(txt))
   print '\n# Separated ligatures'
   txt = [u'\462 ligatured ij', 'no ligature',]
   print 'Text strings:'; pp(txt)
   print 'Normally sorted :'; pp(sorted(txt))
   print 'Naturally sorted:'; pp(ns(txt))
   
   print '\n# Character replacements'
   s = u'ʒſßs' # u'\u0292\u017f\xdfs'
   txt = ['Start with an %s: 2%+i' % (ch, i-2)
          for i,ch in enumerate(s)]
   print 'Text strings:'; pp(txt)
   print 'Normally sorted :'; print '\n'.join(sorted(txt))
   print 'Naturally sorted:'; print '\n'.join(ns(txt))</lang>

Sample Python output

# Ignoring leading spaces
Text strings:
['ignore leading spaces: 2-2',
 ' ignore leading spaces: 2-1',
 '  ignore leading spaces: 2+0',
 '   ignore leading spaces: 2+1']
Normally sorted :
['   ignore leading spaces: 2+1',
 '  ignore leading spaces: 2+0',
 ' ignore leading spaces: 2-1',
 'ignore leading spaces: 2-2']
Naturally sorted:
['  ignore leading spaces: 2+0',
 '   ignore leading spaces: 2+1',
 ' ignore leading spaces: 2-1',
 'ignore leading spaces: 2-2']

# Ignoring multiple adjacent spaces (m.a.s)
Text strings:
['ignore m.a.s spaces: 2-2',
 'ignore m.a.s  spaces: 2-1',
 'ignore m.a.s   spaces: 2+0',
 'ignore m.a.s    spaces: 2+1']
Normally sorted :
['ignore m.a.s    spaces: 2+1',
 'ignore m.a.s   spaces: 2+0',
 'ignore m.a.s  spaces: 2-1',
 'ignore m.a.s spaces: 2-2']
Naturally sorted:
['ignore m.a.s   spaces: 2+0',
 'ignore m.a.s    spaces: 2+1',
 'ignore m.a.s  spaces: 2-1',
 'ignore m.a.s spaces: 2-2']

# Equivalent whitespace characters
Text strings:
['Equiv. spaces: 3-3',
 'Equiv.\rspaces: 3-2',
 'Equiv.\x0cspaces: 3-1',
 'Equiv.\x0bspaces: 3+0',
 'Equiv.\nspaces: 3+1',
 'Equiv.\tspaces: 3+2']
Normally sorted :
['Equiv.\tspaces: 3+2',
 'Equiv.\nspaces: 3+1',
 'Equiv.\x0bspaces: 3+0',
 'Equiv.\x0cspaces: 3-1',
 'Equiv.\rspaces: 3-2',
 'Equiv. spaces: 3-3']
Naturally sorted:
['Equiv.\x0bspaces: 3+0',
 'Equiv.\nspaces: 3+1',
 'Equiv.\tspaces: 3+2',
 'Equiv.\x0cspaces: 3-1',
 'Equiv.\rspaces: 3-2',
 'Equiv. spaces: 3-3']

# Case Indepenent sort
Text strings:
['cASE INDEPENENT: 3-2',
 'caSE INDEPENENT: 3-1',
 'casE INDEPENENT: 3+0',
 'case INDEPENENT: 3+1']
Normally sorted :
['cASE INDEPENENT: 3-2',
 'caSE INDEPENENT: 3-1',
 'casE INDEPENENT: 3+0',
 'case INDEPENENT: 3+1']
Naturally sorted:
['casE INDEPENENT: 3+0',
 'case INDEPENENT: 3+1',
 'caSE INDEPENENT: 3-1',
 'cASE INDEPENENT: 3-2']

# Numeric fields as numerics
Text strings:
['foo100bar99baz0.txt',
 'foo100bar10baz0.txt',
 'foo1000bar99baz10.txt',
 'foo1000bar99baz9.txt']
Normally sorted :
['foo1000bar99baz10.txt',
 'foo1000bar99baz9.txt',
 'foo100bar10baz0.txt',
 'foo100bar99baz0.txt']
Naturally sorted:
['foo100bar10baz0.txt',
 'foo100bar99baz0.txt',
 'foo1000bar99baz9.txt',
 'foo1000bar99baz10.txt']

# Title sorts
Text strings:
['The Wind in the Willows', 'The 40th step more', 'The 39 steps', 'Wanda']
Normally sorted :
['The 39 steps', 'The 40th step more', 'The Wind in the Willows', 'Wanda']
Naturally sorted:
['The 39 steps', 'The 40th step more', 'Wanda', 'The Wind in the Willows']

# Equivalent accented characters (and case)
Text strings:
[u'Equiv. \xfd accents: 2-2',
 u'Equiv. \xdd accents: 2-1',
 u'Equiv. y accents: 2+0',
 u'Equiv. Y accents: 2+1']
Normally sorted :
[u'Equiv. Y accents: 2+1',
 u'Equiv. y accents: 2+0',
 u'Equiv. \xdd accents: 2-1',
 u'Equiv. \xfd accents: 2-2']
Naturally sorted:
[u'Equiv. y accents: 2+0',
 u'Equiv. Y accents: 2+1',
 u'Equiv. \xdd accents: 2-1',
 u'Equiv. \xfd accents: 2-2']

# Separated ligatures
Text strings:
[u'\u0132 ligatured ij', 'no ligature']
Normally sorted :
['no ligature', u'\u0132 ligatured ij']
Naturally sorted:
[u'\u0132 ligatured ij', 'no ligature']

# Character replacements
Text strings:
[u'Start with an \u0292: 2-2',
 u'Start with an \u017f: 2-1',
 u'Start with an \xdf: 2+0',
 u'Start with an s: 2+1']
Normally sorted :
Start with an s: 2+1
Start with an ß: 2+0
Start with an ſ: 2-1
Start with an ʒ: 2-2
Naturally sorted:
Start with an s: 2+1
Start with an ſ: 2-1
Start with an ʒ: 2-2
Start with an ß: 2+0

Racket

Implements 1-4 (but only normalize spaces -- don't ignore spaces at the beginning/end, easy to implement, but sounds wrong).

<lang racket>

  1. lang racket

(define (natural-sort l)

 (define (list<? l1 l2)
   (cond [(null? l2) #f]
         [(null? l1) #t]
         [(number? (car l1)) (cond [(< (car l1) (car l2)) #t]
                                   [(< (car l2) (car l1)) #f]
                                   [else (list<? (cdr l1) (cdr l2))])]
         [(string? (car l1)) (cond [(string<? (car l1) (car l2)) #t]
                                   [(string<? (car l2) (car l1)) #f]
                                   [else (list<? (cdr l1) (cdr l2))])]))
 (define (->keys s)
   (define s* (string-normalize-spaces (string-foldcase s)))
   (for/list ([x (regexp-match* #px"\\d+" s* #:gap-select? #t)]
              [i (in-naturals)])
     (if (odd? i) (string->number x) x)))
 (sort l list<? #:key ->keys #:cache-keys? #t))

(natural-sort

(shuffle '("foo9.txt" "foo10.txt" "x9y99" "x9y100" "x10y0" "x  z" "x y")))
=> '("foo9.txt" "foo10.txt" "x9y99" "x9y100" "x10y0" "x y" "x z")

</lang>

Tcl

Tcl supports two methods of doing sorting by non-natural keys in the lsort command: the -command option allows the specification of code that makes the ordering decision for a pair of values, but instead the code below demonstrates sorting through the use of collation keys, strings that when sorted in their normal order result in the natural order being used. (These are handled through the use of the -indices option which makes it easy to generate a sorted original list without any need to build compound intermediate tuples.)

Note also that Tcl supports case-insensitive sorting and “treat digit sequences as numbers” as native sorting options. (The latter is particularly useful for handling filenames.) <lang tcl>package require Tcl 8.5

proc sortWithCollationKey {keyBuilder list} {

   if {![llength $list]} return
   foreach value $list {

lappend toSort [{*}$keyBuilder $value]

   }
   foreach idx [lsort -indices $toSort] {

lappend result [lindex $list $idx]

   }
   return $result

} proc normalizeSpaces {str} {

   regsub -all {[ ]+} [string trim $str " "] " "

} proc equivalentWhitespace {str} {

   regsub -all {\s} $str " "

}

proc show {description sorter strings} {

   puts "Input:\n\t[join $strings \n\t]"
   set sorted [lsort $strings]
   puts "Normally sorted:\n\t[join $sorted \n\t]"
   set sorted [{*}$sorter $strings]
   puts "Naturally sorted with ${description}:\n\t[join $sorted \n\t]"

}

  1. Two demonstrations of the space normalizer

show "normalized spaces" {sortWithCollationKey normalizeSpaces} {

   {ignore leading spaces: 2-2}
   { ignore leading spaces: 2-1}
   {  ignore leading spaces: 2+0}
   {   ignore leading spaces: 2+1}}

show "normalized spaces" {sortWithCollationKey normalizeSpaces} {

   {ignore m.a.s spaces: 2-2}
   {ignore m.a.s  spaces: 2-1}
   {ignore m.a.s   spaces: 2+0}
   {ignore m.a.s    spaces: 2+1}}
  1. Use a collation key that maps all whitespace to spaces

show "all whitespace equivalent" {sortWithCollationKey equivalentWhitespace} {

   "Equiv. spaces: 3-3"
   "Equiv.\rspaces: 3-2"
   "Equiv.\u000cspaces: 3-1"
   "Equiv.\u000bspaces: 3+0"
   "Equiv.\nspaces: 3+1"
   "Equiv.\tspaces: 3+2"}
  1. These are built-in modes

show "(built-in) case insensitivity" {lsort -nocase} {

   {cASE INDEPENENT: 3-2}
   {caSE INDEPENENT: 3-1}
   {casE INDEPENENT: 3+0}
   {case INDEPENENT: 3+1}}

show "digit sequences as numbers" {lsort -dictionary} {

   foo100bar99baz0.txt
   foo100bar10baz0.txt
   foo1000bar99baz10.txt
   foo1000bar99baz9.txt}</lang>

Output:

Input:
	ignore leading spaces: 2-2
	 ignore leading spaces: 2-1
	  ignore leading spaces: 2+0
	   ignore leading spaces: 2+1
Normally sorted:
	   ignore leading spaces: 2+1
	  ignore leading spaces: 2+0
	 ignore leading spaces: 2-1
	ignore leading spaces: 2-2
Naturally sorted with normalized spaces:
	  ignore leading spaces: 2+0
	   ignore leading spaces: 2+1
	 ignore leading spaces: 2-1
	ignore leading spaces: 2-2
Input:
	ignore m.a.s spaces: 2-2
	ignore m.a.s  spaces: 2-1
	ignore m.a.s   spaces: 2+0
	ignore m.a.s    spaces: 2+1
Normally sorted:
	ignore m.a.s    spaces: 2+1
	ignore m.a.s   spaces: 2+0
	ignore m.a.s  spaces: 2-1
	ignore m.a.s spaces: 2-2
Naturally sorted with normalized spaces:
	ignore m.a.s   spaces: 2+0
	ignore m.a.s    spaces: 2+1
	ignore m.a.s  spaces: 2-1
	ignore m.a.s spaces: 2-2
Input:
	Equiv. spaces: 3-3
spaces: 3-2iv.
	Equiv.
              spaces: 3-1
	Equiv.
              spaces: 3+0
	Equiv.
spaces: 3+1
	Equiv.	spaces: 3+2
Normally sorted:
	Equiv.	spaces: 3+2
	Equiv.
spaces: 3+1
	Equiv.
              spaces: 3+0
	Equiv.
              spaces: 3-1
spaces: 3-2iv.
	Equiv. spaces: 3-3
Naturally sorted with all whitespace equivalent:
	Equiv.
              spaces: 3+0
	Equiv.
spaces: 3+1
	Equiv.	spaces: 3+2
	Equiv.
              spaces: 3-1
spaces: 3-2iv.
	Equiv. spaces: 3-3
Input:
	cASE INDEPENENT: 3-2
	caSE INDEPENENT: 3-1
	casE INDEPENENT: 3+0
	case INDEPENENT: 3+1
Normally sorted:
	cASE INDEPENENT: 3-2
	caSE INDEPENENT: 3-1
	casE INDEPENENT: 3+0
	case INDEPENENT: 3+1
Naturally sorted with (built-in) case insensitivity:
	casE INDEPENENT: 3+0
	case INDEPENENT: 3+1
	caSE INDEPENENT: 3-1
	cASE INDEPENENT: 3-2
Input:
	foo100bar99baz0.txt
	foo100bar10baz0.txt
	foo1000bar99baz10.txt
	foo1000bar99baz9.txt
Normally sorted:
	foo1000bar99baz10.txt
	foo1000bar99baz9.txt
	foo100bar10baz0.txt
	foo100bar99baz0.txt
Naturally sorted with digit sequences as numbers:
	foo100bar10baz0.txt
	foo100bar99baz0.txt
	foo1000bar99baz9.txt
	foo1000bar99baz10.txt