Natural sorting

From Rosetta Code
Revision as of 11:12, 23 April 2011 by rosettacode>Paddy3118 (New draft task and Python solution.)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Natural sorting is a draft programming task. It is not yet considered ready to be promoted as a complete task, for reasons that should be found in its talk page.

Natural sorting is the sorting of text that does more than rely on the order of individual characters codes to make the finding of individual strings easier for a human reader.

Some 'natural' orderings might include:

1. Ignore leading, trailing and multiple adjacent spaces
2. All space types equivalent.
3. Sorting without regard to case.
4. Sorting numeric portions of strings in numeric order. That is split the string into fields on numeric boundaries, then sort on each field, with the rightmost fields being the most significant, and numeric fields of integers treated as numbers.
foo9.txt before foo10.txt
As well as ... x9y99 before x9y100, before x10y0
... (for any number of groups of integers in a string).
5. Title sorts: without regard to a leading, very common, word such
as 'The' in "The thirty-nine steps".
6. Sort letters without regard to accents.
7. Sort ligatures as separate letters.
8. Replacements:
Sort german scharfes S (ß) as ss
Sort ſ, LATIN SMALL LETTER LONG S as s
Sort ʒ, LATIN SMALL LETTER EZH as s
...
Task Description
  • Implement at least four of the eight given features in a natural sorting routine/function/method...
  • Test each feature implemented separately with an ordered set of test strings from the appropriate section here and make sure your naturally sorted output is in the same order.
  • Print and display your output

Note: It is not necessary to have individual control of which features are active in the natural sorting routine at any time.

Python

All eight features: <lang python># -*- coding: utf-8 -*-

  1. Not Python 3.x (Can't compare str and int)


from itertools import groupby from unicodedata import decomposition, name from pprint import pprint as pp

commonleaders = ['the'] # lowercase leading words to ignore replacements = {u'ß': 'ss', # Map single char to replacement string

               u'ſ': 's',
               u'ʒ': 's',
               }

hexdigits = set('0123456789abcdef') decdigits = set('0123456789') # Don't use str.isnumeric

def splitchar(c):

   ' De-ligature. De-accent a char'
   de = decomposition(c)
   if de:
       # Just the words that are also hex numbers
       de = [d for d in de.split()
                 if all(c.lower()
                        in hexdigits for c in d)]
       n = name(c, c).upper()
       # (Gosh it's onerous)
       if len(de)> 1 and 'PRECEDE' in n:
           # E.g. ʼn  LATIN SMALL LETTER N PRECEDED BY APOSTROPHE
           de[1], de[0] = de[0], de[1]
       tmp = [ unichr(int(k, 16)) for k in de]
       base, others = tmp[0], tmp[1:]
       if 'LIGATURE' in n:
           # Assume two character ligature
           base += others.pop(0)
   else:
       base = c
   return base
   

def sortkeygen(s):

   Generate 'natural' sort key for s
   Doctests:
       >>> sortkeygen('  some extra    spaces  ')
       [u'some extra spaces']
       >>> sortkeygen('CasE InseNsItIve')
       [u'case insensitive']
       >>> sortkeygen('The Wind in the Willows')
       [u'wind in the willows']
       >>> sortkeygen(u'\462 ligature')
       [u'ij ligature']
       >>> sortkeygen(u'\335\375 upper/lower case Y with acute accent')
       [u'yy upper/lower case y with acute accent']
       >>> sortkeygen('foo9.txt')
       [u'foo', 9, u'.txt']
       >>> sortkeygen('x9y99')
       [u'x', 9, u'y', 99]
   
   # Ignore leading and trailing spaces
   s = unicode(s).strip()
   # All space types are equivalent
   s = ' '.join(s.split())
   # case insentsitive
   s = s.lower()
   # Title
   words = s.split()
   if len(words) > 1 and words[0] in commonleaders:
       s = ' '.join( words[1:])
   # accent and ligatures
   s = .join(splitchar(c) for c in s)
   # Replacements (single char replaced by one or more)
   s = .join( replacements.get(ch, ch) for ch in s )
   # Numeric sections as numerics
   s = [ int("".join(g)) if isinteger else "".join(g)
         for isinteger,g in groupby(s, lambda x: x in decdigits)]
   return s

def naturalsort(items):

    Naturally sort a series of strings
   Doctests:
       >>> naturalsort(['The Wind in the Willows','The 40th step more',
                        'The 39 steps', 'Wanda'])
       ['The 39 steps', 'The 40th step more', 'Wanda', 'The Wind in the Willows']
   
   return sorted(items, key=sortkeygen)

if __name__ == '__main__':

   import string
   
   ns = naturalsort
   print '\n# Ignoring leading spaces'
   txt = ['%signore leading spaces: 2%+i' % (' '*i, i-2) for i in range(4)]
   print 'Text strings:'; pp(txt)
   print 'Normally sorted :'; pp(sorted(txt))
   print 'Naturally sorted:'; pp(ns(txt))
   print '\n# Ignoring multiple adjacent spaces (m.a.s)'
   txt = ['ignore m.a.s%s spaces: 2%+i' % (' '*i, i-2) for i in range(4)]
   print 'Text strings:'; pp(txt)
   print 'Normally sorted :'; pp(sorted(txt))
   print 'Naturally sorted:'; pp(ns(txt))
   print '\n# Equivalent whitespace characters'
   txt = ['Equiv.%sspaces: 3%+i' % (ch, i-3)
          for i,ch in enumerate(reversed(string.whitespace))]
   print 'Text strings:'; pp(txt)
   print 'Normally sorted :'; pp(sorted(txt))
   print 'Naturally sorted:'; pp(ns(txt))
   print '\n# Case Indepenent sort'
   s = 'CASE INDEPENENT'
   txt = [s[:i].lower() + s[i:] + ': 3%+i' % (i-3) for i in range(1,5)]
   print 'Text strings:'; pp(txt)
   print 'Normally sorted :'; pp(sorted(txt))
   print 'Naturally sorted:'; pp(ns(txt))
   print '\n# Numeric fields as numerics'
   txt = ['foo100bar99baz0.txt', 'foo100bar10baz0.txt',
          'foo1000bar99baz10.txt', 'foo1000bar99baz9.txt']
   print 'Text strings:'; pp(txt)
   print 'Normally sorted :'; pp(sorted(txt))
   print 'Naturally sorted:'; pp(ns(txt))
   print '\n# Title sorts'
   txt = ['The Wind in the Willows','The 40th step more',
                        'The 39 steps', 'Wanda']
   print 'Text strings:'; pp(txt)
   print 'Normally sorted :'; pp(sorted(txt))
   print 'Naturally sorted:'; pp(ns(txt))
   print '\n# Equivalent accented characters (and case)'
   txt = ['Equiv. %s accents: 2%+i' % (ch, i-2)
          for i,ch in enumerate(u'\xfd\xddyY')]
   print 'Text strings:'; pp(txt)
   print 'Normally sorted :'; pp(sorted(txt))
   print 'Naturally sorted:'; pp(ns(txt))
   print '\n# Separated ligatures'
   txt = [u'\462 ligatured ij', 'no ligature',]
   print 'Text strings:'; pp(txt)
   print 'Normally sorted :'; pp(sorted(txt))
   print 'Naturally sorted:'; pp(ns(txt))
   
   print '\n# Character replacements'
   s = u'ʒſßs' # u'\u0292\u017f\xdfs'
   txt = ['Start with an %s: 2%+i' % (ch, i-2)
          for i,ch in enumerate(s)]
   print 'Text strings:'; pp(txt)
   print 'Normally sorted :'; print '\n'.join(sorted(txt))
   print 'Naturally sorted:'; print '\n'.join(ns(txt))</lang>

Sample Python output

# Ignoring leading spaces
Text strings:
['ignore leading spaces: 2-2',
 ' ignore leading spaces: 2-1',
 '  ignore leading spaces: 2+0',
 '   ignore leading spaces: 2+1']
Normally sorted :
['   ignore leading spaces: 2+1',
 '  ignore leading spaces: 2+0',
 ' ignore leading spaces: 2-1',
 'ignore leading spaces: 2-2']
Naturally sorted:
['  ignore leading spaces: 2+0',
 '   ignore leading spaces: 2+1',
 ' ignore leading spaces: 2-1',
 'ignore leading spaces: 2-2']

# Ignoring multiple adjacent spaces (m.a.s)
Text strings:
['ignore m.a.s spaces: 2-2',
 'ignore m.a.s  spaces: 2-1',
 'ignore m.a.s   spaces: 2+0',
 'ignore m.a.s    spaces: 2+1']
Normally sorted :
['ignore m.a.s    spaces: 2+1',
 'ignore m.a.s   spaces: 2+0',
 'ignore m.a.s  spaces: 2-1',
 'ignore m.a.s spaces: 2-2']
Naturally sorted:
['ignore m.a.s   spaces: 2+0',
 'ignore m.a.s    spaces: 2+1',
 'ignore m.a.s  spaces: 2-1',
 'ignore m.a.s spaces: 2-2']

# Equivalent whitespace characters
Text strings:
['Equiv. spaces: 3-3',
 'Equiv.\rspaces: 3-2',
 'Equiv.\x0cspaces: 3-1',
 'Equiv.\x0bspaces: 3+0',
 'Equiv.\nspaces: 3+1',
 'Equiv.\tspaces: 3+2']
Normally sorted :
['Equiv.\tspaces: 3+2',
 'Equiv.\nspaces: 3+1',
 'Equiv.\x0bspaces: 3+0',
 'Equiv.\x0cspaces: 3-1',
 'Equiv.\rspaces: 3-2',
 'Equiv. spaces: 3-3']
Naturally sorted:
['Equiv.\x0bspaces: 3+0',
 'Equiv.\nspaces: 3+1',
 'Equiv.\tspaces: 3+2',
 'Equiv.\x0cspaces: 3-1',
 'Equiv.\rspaces: 3-2',
 'Equiv. spaces: 3-3']

# Case Indepenent sort
Text strings:
['cASE INDEPENENT: 3-2',
 'caSE INDEPENENT: 3-1',
 'casE INDEPENENT: 3+0',
 'case INDEPENENT: 3+1']
Normally sorted :
['cASE INDEPENENT: 3-2',
 'caSE INDEPENENT: 3-1',
 'casE INDEPENENT: 3+0',
 'case INDEPENENT: 3+1']
Naturally sorted:
['casE INDEPENENT: 3+0',
 'case INDEPENENT: 3+1',
 'caSE INDEPENENT: 3-1',
 'cASE INDEPENENT: 3-2']

# Numeric fields as numerics
Text strings:
['foo100bar99baz0.txt',
 'foo100bar10baz0.txt',
 'foo1000bar99baz10.txt',
 'foo1000bar99baz9.txt']
Normally sorted :
['foo1000bar99baz10.txt',
 'foo1000bar99baz9.txt',
 'foo100bar10baz0.txt',
 'foo100bar99baz0.txt']
Naturally sorted:
['foo100bar10baz0.txt',
 'foo100bar99baz0.txt',
 'foo1000bar99baz9.txt',
 'foo1000bar99baz10.txt']

# Title sorts
Text strings:
['The Wind in the Willows', 'The 40th step more', 'The 39 steps', 'Wanda']
Normally sorted :
['The 39 steps', 'The 40th step more', 'The Wind in the Willows', 'Wanda']
Naturally sorted:
['The 39 steps', 'The 40th step more', 'Wanda', 'The Wind in the Willows']

# Equivalent accented characters (and case)
Text strings:
[u'Equiv. \xfd accents: 2-2',
 u'Equiv. \xdd accents: 2-1',
 u'Equiv. y accents: 2+0',
 u'Equiv. Y accents: 2+1']
Normally sorted :
[u'Equiv. Y accents: 2+1',
 u'Equiv. y accents: 2+0',
 u'Equiv. \xdd accents: 2-1',
 u'Equiv. \xfd accents: 2-2']
Naturally sorted:
[u'Equiv. y accents: 2+0',
 u'Equiv. Y accents: 2+1',
 u'Equiv. \xdd accents: 2-1',
 u'Equiv. \xfd accents: 2-2']

# Separated ligatures
Text strings:
[u'\u0132 ligatured ij', 'no ligature']
Normally sorted :
['no ligature', u'\u0132 ligatured ij']
Naturally sorted:
[u'\u0132 ligatured ij', 'no ligature']

# Character replacements
Text strings:
[u'Start with an \u0292: 2-2',
 u'Start with an \u017f: 2-1',
 u'Start with an \xdf: 2+0',
 u'Start with an s: 2+1']
Normally sorted :
Start with an s: 2+1
Start with an ß: 2+0
Start with an ſ: 2-1
Start with an ʒ: 2-2
Naturally sorted:
Start with an s: 2+1
Start with an ſ: 2-1
Start with an ʒ: 2-2
Start with an ß: 2+0