Natural sorting
Natural sorting is the sorting of text that does more than rely on the order of individual characters codes to make the finding of individual strings easier for a human reader.
Some 'natural' orderings might include:
- 1. Ignore leading, trailing and multiple adjacent spaces
- 2. All space types equivalent.
- 3. Sorting without regard to case.
- 4. Sorting numeric portions of strings in numeric order. That is split the string into fields on numeric boundaries, then sort on each field, with the rightmost fields being the most significant, and numeric fields of integers treated as numbers.
- foo9.txt before foo10.txt
- As well as ... x9y99 before x9y100, before x10y0
- ... (for any number of groups of integers in a string).
- 5. Title sorts: without regard to a leading, very common, word such
- as 'The' in "The thirty-nine steps".
- 6. Sort letters without regard to accents.
- 7. Sort ligatures as separate letters.
- 8. Replacements:
- Sort german scharfes S (ß) as ss
- Sort ſ, LATIN SMALL LETTER LONG S as s
- Sort ʒ, LATIN SMALL LETTER EZH as s
- ...
- Task Description
- Implement at least four of the eight given features in a natural sorting routine/function/method...
- Test each feature implemented separately with an ordered set of test strings from the appropriate section here and make sure your naturally sorted output is in the same order.
- Print and display your output
Note: It is not necessary to have individual control of which features are active in the natural sorting routine at any time.
Python
All eight features: <lang python># -*- coding: utf-8 -*-
- Not Python 3.x (Can't compare str and int)
from itertools import groupby
from unicodedata import decomposition, name
from pprint import pprint as pp
commonleaders = ['the'] # lowercase leading words to ignore replacements = {u'ß': 'ss', # Map single char to replacement string
u'ſ': 's', u'ʒ': 's', }
hexdigits = set('0123456789abcdef') decdigits = set('0123456789') # Don't use str.isnumeric
def splitchar(c):
' De-ligature. De-accent a char' de = decomposition(c) if de: # Just the words that are also hex numbers de = [d for d in de.split() if all(c.lower() in hexdigits for c in d)] n = name(c, c).upper() # (Gosh it's onerous) if len(de)> 1 and 'PRECEDE' in n: # E.g. ʼn LATIN SMALL LETTER N PRECEDED BY APOSTROPHE de[1], de[0] = de[0], de[1] tmp = [ unichr(int(k, 16)) for k in de] base, others = tmp[0], tmp[1:] if 'LIGATURE' in n: # Assume two character ligature base += others.pop(0) else: base = c return base
def sortkeygen(s):
Generate 'natural' sort key for s
Doctests: >>> sortkeygen(' some extra spaces ') [u'some extra spaces'] >>> sortkeygen('CasE InseNsItIve') [u'case insensitive'] >>> sortkeygen('The Wind in the Willows') [u'wind in the willows'] >>> sortkeygen(u'\462 ligature') [u'ij ligature'] >>> sortkeygen(u'\335\375 upper/lower case Y with acute accent') [u'yy upper/lower case y with acute accent'] >>> sortkeygen('foo9.txt') [u'foo', 9, u'.txt'] >>> sortkeygen('x9y99') [u'x', 9, u'y', 99] # Ignore leading and trailing spaces s = unicode(s).strip() # All space types are equivalent s = ' '.join(s.split()) # case insentsitive s = s.lower() # Title words = s.split() if len(words) > 1 and words[0] in commonleaders: s = ' '.join( words[1:]) # accent and ligatures s = .join(splitchar(c) for c in s) # Replacements (single char replaced by one or more) s = .join( replacements.get(ch, ch) for ch in s ) # Numeric sections as numerics s = [ int("".join(g)) if isinteger else "".join(g) for isinteger,g in groupby(s, lambda x: x in decdigits)]
return s
def naturalsort(items):
Naturally sort a series of strings
Doctests: >>> naturalsort(['The Wind in the Willows','The 40th step more', 'The 39 steps', 'Wanda']) ['The 39 steps', 'The 40th step more', 'Wanda', 'The Wind in the Willows']
return sorted(items, key=sortkeygen)
if __name__ == '__main__':
import string ns = naturalsort
print '\n# Ignoring leading spaces' txt = ['%signore leading spaces: 2%+i' % (' '*i, i-2) for i in range(4)] print 'Text strings:'; pp(txt) print 'Normally sorted :'; pp(sorted(txt)) print 'Naturally sorted:'; pp(ns(txt))
print '\n# Ignoring multiple adjacent spaces (m.a.s)' txt = ['ignore m.a.s%s spaces: 2%+i' % (' '*i, i-2) for i in range(4)] print 'Text strings:'; pp(txt) print 'Normally sorted :'; pp(sorted(txt)) print 'Naturally sorted:'; pp(ns(txt))
print '\n# Equivalent whitespace characters' txt = ['Equiv.%sspaces: 3%+i' % (ch, i-3) for i,ch in enumerate(reversed(string.whitespace))] print 'Text strings:'; pp(txt) print 'Normally sorted :'; pp(sorted(txt)) print 'Naturally sorted:'; pp(ns(txt))
print '\n# Case Indepenent sort' s = 'CASE INDEPENENT' txt = [s[:i].lower() + s[i:] + ': 3%+i' % (i-3) for i in range(1,5)] print 'Text strings:'; pp(txt) print 'Normally sorted :'; pp(sorted(txt)) print 'Naturally sorted:'; pp(ns(txt))
print '\n# Numeric fields as numerics' txt = ['foo100bar99baz0.txt', 'foo100bar10baz0.txt', 'foo1000bar99baz10.txt', 'foo1000bar99baz9.txt'] print 'Text strings:'; pp(txt) print 'Normally sorted :'; pp(sorted(txt)) print 'Naturally sorted:'; pp(ns(txt))
print '\n# Title sorts' txt = ['The Wind in the Willows','The 40th step more', 'The 39 steps', 'Wanda'] print 'Text strings:'; pp(txt) print 'Normally sorted :'; pp(sorted(txt)) print 'Naturally sorted:'; pp(ns(txt))
print '\n# Equivalent accented characters (and case)' txt = ['Equiv. %s accents: 2%+i' % (ch, i-2) for i,ch in enumerate(u'\xfd\xddyY')] print 'Text strings:'; pp(txt) print 'Normally sorted :'; pp(sorted(txt)) print 'Naturally sorted:'; pp(ns(txt))
print '\n# Separated ligatures' txt = [u'\462 ligatured ij', 'no ligature',] print 'Text strings:'; pp(txt) print 'Normally sorted :'; pp(sorted(txt)) print 'Naturally sorted:'; pp(ns(txt)) print '\n# Character replacements' s = u'ʒſßs' # u'\u0292\u017f\xdfs' txt = ['Start with an %s: 2%+i' % (ch, i-2) for i,ch in enumerate(s)] print 'Text strings:'; pp(txt) print 'Normally sorted :'; print '\n'.join(sorted(txt)) print 'Naturally sorted:'; print '\n'.join(ns(txt))</lang>
Sample Python output
# Ignoring leading spaces Text strings: ['ignore leading spaces: 2-2', ' ignore leading spaces: 2-1', ' ignore leading spaces: 2+0', ' ignore leading spaces: 2+1'] Normally sorted : [' ignore leading spaces: 2+1', ' ignore leading spaces: 2+0', ' ignore leading spaces: 2-1', 'ignore leading spaces: 2-2'] Naturally sorted: [' ignore leading spaces: 2+0', ' ignore leading spaces: 2+1', ' ignore leading spaces: 2-1', 'ignore leading spaces: 2-2'] # Ignoring multiple adjacent spaces (m.a.s) Text strings: ['ignore m.a.s spaces: 2-2', 'ignore m.a.s spaces: 2-1', 'ignore m.a.s spaces: 2+0', 'ignore m.a.s spaces: 2+1'] Normally sorted : ['ignore m.a.s spaces: 2+1', 'ignore m.a.s spaces: 2+0', 'ignore m.a.s spaces: 2-1', 'ignore m.a.s spaces: 2-2'] Naturally sorted: ['ignore m.a.s spaces: 2+0', 'ignore m.a.s spaces: 2+1', 'ignore m.a.s spaces: 2-1', 'ignore m.a.s spaces: 2-2'] # Equivalent whitespace characters Text strings: ['Equiv. spaces: 3-3', 'Equiv.\rspaces: 3-2', 'Equiv.\x0cspaces: 3-1', 'Equiv.\x0bspaces: 3+0', 'Equiv.\nspaces: 3+1', 'Equiv.\tspaces: 3+2'] Normally sorted : ['Equiv.\tspaces: 3+2', 'Equiv.\nspaces: 3+1', 'Equiv.\x0bspaces: 3+0', 'Equiv.\x0cspaces: 3-1', 'Equiv.\rspaces: 3-2', 'Equiv. spaces: 3-3'] Naturally sorted: ['Equiv.\x0bspaces: 3+0', 'Equiv.\nspaces: 3+1', 'Equiv.\tspaces: 3+2', 'Equiv.\x0cspaces: 3-1', 'Equiv.\rspaces: 3-2', 'Equiv. spaces: 3-3'] # Case Indepenent sort Text strings: ['cASE INDEPENENT: 3-2', 'caSE INDEPENENT: 3-1', 'casE INDEPENENT: 3+0', 'case INDEPENENT: 3+1'] Normally sorted : ['cASE INDEPENENT: 3-2', 'caSE INDEPENENT: 3-1', 'casE INDEPENENT: 3+0', 'case INDEPENENT: 3+1'] Naturally sorted: ['casE INDEPENENT: 3+0', 'case INDEPENENT: 3+1', 'caSE INDEPENENT: 3-1', 'cASE INDEPENENT: 3-2'] # Numeric fields as numerics Text strings: ['foo100bar99baz0.txt', 'foo100bar10baz0.txt', 'foo1000bar99baz10.txt', 'foo1000bar99baz9.txt'] Normally sorted : ['foo1000bar99baz10.txt', 'foo1000bar99baz9.txt', 'foo100bar10baz0.txt', 'foo100bar99baz0.txt'] Naturally sorted: ['foo100bar10baz0.txt', 'foo100bar99baz0.txt', 'foo1000bar99baz9.txt', 'foo1000bar99baz10.txt'] # Title sorts Text strings: ['The Wind in the Willows', 'The 40th step more', 'The 39 steps', 'Wanda'] Normally sorted : ['The 39 steps', 'The 40th step more', 'The Wind in the Willows', 'Wanda'] Naturally sorted: ['The 39 steps', 'The 40th step more', 'Wanda', 'The Wind in the Willows'] # Equivalent accented characters (and case) Text strings: [u'Equiv. \xfd accents: 2-2', u'Equiv. \xdd accents: 2-1', u'Equiv. y accents: 2+0', u'Equiv. Y accents: 2+1'] Normally sorted : [u'Equiv. Y accents: 2+1', u'Equiv. y accents: 2+0', u'Equiv. \xdd accents: 2-1', u'Equiv. \xfd accents: 2-2'] Naturally sorted: [u'Equiv. y accents: 2+0', u'Equiv. Y accents: 2+1', u'Equiv. \xdd accents: 2-1', u'Equiv. \xfd accents: 2-2'] # Separated ligatures Text strings: [u'\u0132 ligatured ij', 'no ligature'] Normally sorted : ['no ligature', u'\u0132 ligatured ij'] Naturally sorted: [u'\u0132 ligatured ij', 'no ligature'] # Character replacements Text strings: [u'Start with an \u0292: 2-2', u'Start with an \u017f: 2-1', u'Start with an \xdf: 2+0', u'Start with an s: 2+1'] Normally sorted : Start with an s: 2+1 Start with an ß: 2+0 Start with an ſ: 2-1 Start with an ʒ: 2-2 Naturally sorted: Start with an s: 2+1 Start with an ſ: 2-1 Start with an ʒ: 2-2 Start with an ß: 2+0