Isograms and heterograms

Revision as of 00:55, 29 December 2022 by GordonCharlton (talk | contribs) (Added Quackery.)

For the purposes of this task, an isogram means a string where each character present is used the same number of times and an n-isogram means an isogram where each character present is used exactly n times.

Isograms and heterograms is a draft programming task. It is not yet considered ready to be promoted as a complete task, for reasons that should be found in its talk page.
Definitions

A heterogram means a string in which no character occurs more than once. It follows that a heterogram is the same thing as a 1-isogram.


Examples

caucasus is a 2-isogram because the letters c, a, u and s all occur twice.

atmospheric is a heterogram because all its letters are used once only.


Task

Using unixdict.txt and ignoring capitalization:


1) Find and display here all words which are n-isograms where n > 1.

Present the results as a single list but sorted as follows:

a. By decreasing order of n;

b. Then by decreasing order of word length;

c. Then by ascending lexicographic order.

2) Secondly, find and display here all words which are heterograms and have more than 10 characters.

Again present the results as a single list but sorted as per b. and c. above.


Reference


Other tasks related to string operations:
Metrics
Counting
Remove/replace
Anagrams/Derangements/shuffling
Find/Search/Determine
Formatting
Song lyrics/poems/Mad Libs/phrases
Tokenize
Sequences


ALGOL 68

# find some isograms ( words where each letter occurs the same number of     #
# times as the others ) and heterograms ( words where each letter occurs     #
# once ). Note a heterogram is an isogram of order 1                         #
IF  FILE input file;
    STRING file name = "unixdict.txt";
    open( input file, file name, stand in channel ) /= 0
THEN
    # failed to open the file #
    print( ( "Unable to open """ + file name + """", newline ) )
ELSE
    # file opened OK #
    BOOL at eof := FALSE;
    # set the EOF handler for the file #
    on logical file end( input file
                       , ( REF FILE f )BOOL:
                         BEGIN # note that we reached EOF on the latest read #
                               # and return TRUE so processing can continue  #
                               at eof := TRUE
                         END
                       );

    # in-place quick sort an array of strings                                #
    PROC s quicksort = ( REF[]STRING a, INT lb, ub )VOID:
         IF ub > lb
         THEN
            # more than one element, so must sort                            #
            INT  left  := lb;
            INT  right := ub;
            # choosing the middle element of the array as the pivot          #
            STRING pivot := a[ left + ( ( right + 1 ) - left ) OVER 2 ];
            WHILE
                WHILE IF left  <= ub THEN a[ left  ] < pivot ELSE FALSE FI
                DO
                    left  +:= 1
                OD;
                WHILE IF right >= lb THEN a[ right ] > pivot ELSE FALSE FI
                DO
                    right -:= 1
                OD;
                left <= right
            DO
                STRING t   := a[ left  ];
                a[ left  ] := a[ right ];
                a[ right ] := t;
                left      +:= 1;
                right     -:= 1
            OD;
            s quicksort( a, lb,   right );
            s quicksort( a, left, ub    )
         FI # s quicksort # ;

    # returns the length of s                                                #
    OP LENGTH = ( STRING s )INT: 1 + ( UPB s - LWB s );
    # returns n if s is an isogram of order n, 0 if s is not an isogram      #
    OP ORDER  = ( STRING s )INT:
       BEGIN
        # count the number of times each character occurs                    #
        [ 0 : max abs char ]INT count;
        FOR i FROM LWB count TO UPB count DO count[ i ] := 0 OD;
        FOR i FROM LWB s TO UPB s DO
            CHAR c = s[ i ];
            IF c >= "A" AND c <= "Z" THEN
                # uppercase - treat as lower                                 #
                count[ ( ABS c - ABS "A" ) + ABS "a" ] +:= 1
            ELSE
                # lowercase or non-letter                                    #
                count[ ABS c ] +:= 1
            FI
        OD;
        INT order := -1;
        # check the characters all occur the same number of times            #
        FOR i FROM LWB count TO UPB count WHILE order /= 0 DO
            IF count[ i ] /= 0 THEN
                # have a characetr that appeared in s                        #
                IF   order = -1 THEN
                    # first character                                        #
                    order := count[ i ]
                ELIF order /= count[ i ] THEN
                    # character occured a different number of times to       #
                    # the previous one                                       #
                    order := 0
                FI
            FI
        OD;
        IF order < 0 THEN 0 ELSE order FI
       END # ORDER # ;
    [ 1 : 2 000 ]STRING words;
    INT   w count := 0;
    WHILE NOT at eof
    DO
        STRING word;
        get( input file, ( word, newline ) );
        IF NOT at eof THEN
            # have another word                                              #
            INT order = ORDER word;
            IF order > 0 THEN
                INT w length = LENGTH word;
                IF ( order = 1 AND w length > 10 ) OR order > 1 THEN
                    # a long heterogram or an isogram                        #
                    # store the word prefixed by the max abs char complement #
                    # of the order and the length so when sorted, the words  #
                    # are ordered as requierd by the task                    #
                    STRING s word = REPR ( max abs char - order    )
                                  + REPR ( max abs char - w length )
                                  + word;
                    words[ w count +:= 1 ] := s word
                FI
            FI
        FI
    OD;
    close( input file );
    # sort the words                                                         #
    s quicksort( words, 1, w count );
    # display the words                                                      #
    INT prev order  := 0;
    INT prev length := 999 999;
    INT p count     := 0;
    FOR w TO w count DO
        STRING gram   = words[ w ];
        INT    order  = max abs char - ABS gram[ 1 ];
        INT    length = max abs char - ABS gram[ 2 ];
        STRING word   = gram[ 3 : ];
        IF order /= prev order THEN
            IF order = 1 THEN
                print( ( newline, "heterograms longer than 10 characters" ) )
            ELSE
                print( ( newline, "isograms of order ", whole( order, 0 ) )
                     )
            FI;
            prev order  := order;
            prev length := 999 999;
            p count     := 0
        FI;
        IF prev length > length OR p count > 5 THEN
            print( ( newline ) );
            prev length := length;
            p count     := 0
        FI;
        print( ( " " * IF length > 11 THEN 1 ELSE 13 - length FI, word ) );
        p count +:= 1
    OD
FI
Output:

isograms of order 3
          aaa          iii
isograms of order 2
     beriberi     bilabial     caucasus     couscous     teammate
       appall       emmett       hannah       murmur       tartar       testes
         anna         coco         dada         deed         dodo         gogo
         isis         juju         lulu         mimi         noon         otto
         papa         peep         poop         teet         tete         toot
         tutu
           ii
heterograms longer than 10 characters
 ambidextrous bluestocking exclusionary incomputable lexicography loudspeaking
 malnourished
  atmospheric  blameworthy  centrifugal  christendom  consumptive  countervail
  countryside  countrywide  disturbance  documentary  earthmoving  exculpatory
  geophysical  inscrutable  misanthrope  problematic  selfadjoint  stenography
  sulfonamide  switchblade  switchboard  switzerland  thunderclap  valedictory
  voluntarism

Factor

Works with: Factor version 0.99 2022-04-03
USING: assocs combinators.short-circuit.smart grouping io
io.encodings.ascii io.files kernel literals math math.order
math.statistics sequences sets sorting ;

CONSTANT: words $[ "unixdict.txt" ascii file-lines ]

: isogram<=> ( a b -- <=> )
    { [ histogram values first ] [ length ] } compare-with ;

: isogram-sort ( seq -- seq' )
    [ isogram<=> invert-comparison ] sort ;

: isogram? ( seq -- ? )
    histogram values { [ first 1 > ] [ all-eq? ] } && ;

: .words-by ( quot -- )
    words swap filter isogram-sort [ print ] each ; inline

"List of n-isograms where n > 1:" print
[ isogram? ] .words-by nl

"List of heterograms of length > 10:" print
[ { [ length 10 > ] [ all-unique? ] } && ] .words-by
Output:
List of n-isograms where n > 1:
aaa
iii
beriberi
bilabial
caucasus
couscous
teammate
appall
emmett
hannah
murmur
tartar
testes
anna
coco
dada
deed
dodo
gogo
isis
juju
lulu
mimi
noon
otto
papa
peep
poop
teet
tete
toot
tutu
ii

List of heterograms of length > 10:
ambidextrous
bluestocking
exclusionary
incomputable
lexicography
loudspeaking
malnourished
atmospheric
blameworthy
centrifugal
christendom
consumptive
countervail
countryside
countrywide
disturbance
documentary
earthmoving
exculpatory
geophysical
inscrutable
misanthrope
problematic
selfadjoint
stenography
sulfonamide
switchblade
switchboard
switzerland
thunderclap
valedictory
voluntarism

J

For this task, we want to know the value of n for n-isograms. This value would be zero for words which are not n-isograms. We can implement this by counting how many times each character occurs and determining whether that value is unique. (If it's the unique value, n is the number of times the first character occurs):

isogram=: {{ {. (#~ 1= #@~.) #/.~ y }} S:0

Also, it's worth noting that unixdict.txt is already in sorted order, even after coercing its contents to lower case:

   (-: /:~) cutLF tolower fread 'unixdict.txt'
1

With this tool and this knowledge, we are ready to tackle this task (the /: expression sorts, and the #~ expression selects):

   > (/: -@isogram,.-@#@>) (#~ 1<isogram) cutLF tolower fread 'unixdict.txt'
aaa     
iii     
beriberi
bilabial
caucasus
couscous
teammate
appall  
emmett  
hannah  
murmur  
tartar  
testes  
anna    
coco    
dada    
deed    
dodo    
gogo    
isis    
juju    
lulu    
mimi    
noon    
otto    
papa    
peep    
poop    
teet    
tete    
toot    
tutu    
ii      
   > (/: -@#@>) (#~ (10 < #@>) * 1=isogram) cutLF tolower fread 'unixdict.txt'
ambidextrous
bluestocking
exclusionary
incomputable
lexicography
loudspeaking
malnourished
atmospheric 
blameworthy 
centrifugal 
christendom 
consumptive 
countervail 
countryside 
countrywide 
disturbance 
documentary 
earthmoving 
exculpatory 
geophysical 
inscrutable 
misanthrope 
problematic 
selfadjoint 
stenography 
sulfonamide 
switchblade 
switchboard 
switzerland 
thunderclap 
valedictory 
voluntarism

jq

This entry assumes that the external file of words does not contain duplicates.

# bag of words
def bow(stream): 
  reduce stream as $word ({}; .[($word|tostring)] += 1);

# If the input string is an n-isogram then return n, otherwise 0:
def isogram:
  bow(ascii_downcase|explode[]|[.]|implode)
  | .[keys_unsorted[0]] as $n
  | if all(.[]; . == $n) then $n else 0 end ;

# Read the word list (inputs) and record the n-isogram value.
# Output: an array of [word, n] values
def words:
  [inputs
   | select(test("^[A-Za-z]+$"))
   | sub("^ +";"") | sub(" +$";"")
   | [., isogram] ];

# Input: an array of [word, n] values
# Sort by decreasing order of n;
# Then by decreasing order of word length;
# Then by ascending lexicographic order
def isograms:
  map( select( .[1] > 1) )
  | sort_by( .[0])
  | sort_by( - (.[0]|length))
  | sort_by( - .[1]);

# Input: an array of [word, n] values
# Sort as for isograms
def heterograms($minlength):
  map(select (.[1] == 1 and (.[0]|length) >= $minlength))
  | sort_by( .[0])
  | sort_by( - (.[0]|length));

words
| (isograms
   | "List of the \(length) n-isograms for which n > 1:",
     foreach .[] as [$word, $n] ({};
        .header = if $n != .group then "\nisograms of order \($n)" else null end
	| .group = $n;
	(.header | select(.)), $word ) ) ,

  (heterograms(11)
   | "\nList of the \(length) heterograms with length > 10:", .[][0])

Invocation

< unixdict.txt jq -Rrn -f isograms-and-heterograms.jq
Output:
List of the 33 n-isograms for which n > 1:

isograms of order 3
aaa
iii

isograms of order 2
beriberi
bilabial
caucasus
couscous
teammate
appall
emmett
hannah
murmur
tartar
testes
anna
coco
dada
deed
dodo
gogo
isis
juju
lulu
mimi
noon
otto
papa
peep
poop
teet
tete
toot
tutu
ii

List of the 32 heterograms with length > 10:
ambidextrous
bluestocking
exclusionary
incomputable
lexicography
loudspeaking
malnourished
atmospheric
blameworthy
centrifugal
christendom
consumptive
countervail
countryside
countrywide
disturbance
documentary
earthmoving
exculpatory
geophysical
inscrutable
misanthrope
problematic
selfadjoint
stenography
sulfonamide
switchblade
switchboard
switzerland
thunderclap
valedictory
voluntarism


Julia

function isogram(word)
    wchars, uchars = collect(word), unique(collect(word))
    ulen, wlen = length(uchars), length(wchars)
    (wlen == 1 || ulen == wlen) && return 1
    n = count(==(first(uchars)), wchars)
    return all(i -> count(==(uchars[i]), wchars) == n, 2:ulen) ? n : 0
end

words = split(lowercase(read("documents/julia/unixdict.txt", String)), r"\s+")
orderlengthtuples = [(isogram(w), length(w), w) for w in words]

tcomp(x, y) = (x[1] != y[1] ? y[1] < x[1] : x[2] != y[2] ? y[2] < x[2] : x[3] < y[3])

nisograms = sort!(filter(t -> t[1] > 1, orderlengthtuples), lt = tcomp)
heterograms = sort!(filter(t -> t[1] == 1 && length(t[3]) > 10, orderlengthtuples), lt = tcomp)

println("N-Isogram   N  Length\n", "-"^24)
foreach(t -> println(rpad(t[3], 8), lpad(t[1], 5), lpad(t[2], 5)), nisograms)
println("\nHeterogram   Length\n", "-"^20)
foreach(t -> println(rpad(t[3], 12), lpad(t[2], 5)), heterograms)
Output:
N-Isogram   N  Length
------------------------
aaa         3    3
iii         3    3
beriberi    2    8
bilabial    2    8
caucasus    2    8
couscous    2    8
teammate    2    8
appall      2    6
emmett      2    6
hannah      2    6
murmur      2    6
tartar      2    6
testes      2    6
anna        2    4
coco        2    4
dada        2    4
deed        2    4
dodo        2    4
gogo        2    4
isis        2    4
juju        2    4
lulu        2    4
mimi        2    4
noon        2    4
otto        2    4
papa        2    4
peep        2    4
poop        2    4
teet        2    4
tete        2    4
toot        2    4
tutu        2    4
ii          2    2

Heterogram   Length
--------------------
ambidextrous   12
bluestocking   12
exclusionary   12
incomputable   12
lexicography   12
loudspeaking   12
malnourished   12
atmospheric    11
blameworthy    11
centrifugal    11
christendom    11
consumptive    11
countervail    11
countryside    11
countrywide    11
disturbance    11
documentary    11
earthmoving    11
exculpatory    11
geophysical    11
inscrutable    11
misanthrope    11
problematic    11
selfadjoint    11
stenography    11
sulfonamide    11
switchblade    11
switchboard    11
switzerland    11
thunderclap    11
valedictory    11
voluntarism    11

Perl

use strict;
use warnings;
use feature 'say';
use Path::Tiny;
use List::Util 'uniq';

my @words = map { lc } path('unixdict.txt')->slurp =~ /^[A-z]{2,}$/gm;

my(@heterogram, %isogram);
for my $w (@words) {
    my %l;
    $l{$_}++ for split '', $w;
    next unless 1 == scalar (my @x = uniq values %l);
    if ($x[0] == 1) { push @heterogram,        $w if length $w > 10 }
    else            { push @{$isogram{$x[0]}}, $w                   }
}

for my $n (reverse sort keys %isogram) {
    my @i = sort { length $b <=> length $a } @{$isogram{$n}};
    say scalar @i . " $n-isograms:\n" . join("\n", @i) . "\n";
}

say scalar(@heterogram) . " heterograms with more than 10 characters:\n" . join "\n", sort { length $b <=> length $a } @heterogram;
Output:
2 3-isograms:
aaa
iii

31 2-isograms:
beriberi
bilabial
caucasus
couscous
teammate
appall
emmett
hannah
murmur
tartar
testes
anna
coco
dada
deed
dodo
gogo
isis
juju
lulu
mimi
noon
otto
papa
peep
poop
teet
tete
toot
tutu
ii

32 heterograms with more than 10 characters:
ambidextrous
bluestocking
exclusionary
incomputable
lexicography
loudspeaking
malnourished
atmospheric
blameworthy
centrifugal
christendom
consumptive
countervail
countryside
countrywide
disturbance
documentary
earthmoving
exculpatory
geophysical
inscrutable
misanthrope
problematic
selfadjoint
stenography
sulfonamide
switchblade
switchboard
switzerland
thunderclap
valedictory
voluntarism

Phix

with javascript_semantics
function isogram(string word)
    sequence chars = {}, counts = {}
    for ch in word do
        integer k = find(ch,chars)
        if k=0 then
            chars &= ch
            counts &= 1
        else
            counts[k] += 1
        end if
    end for
    integer c1 = counts[1], lc = length(counts), lw = length(word)
    return iff((c1>1 or lw>10) and counts=repeat(c1,lc)?{word,c1,lw}:0)
end function
 
sequence res = sort_columns(filter(apply(unix_dict(),isogram),"!=",0),{-2,-3,1})
printf(1,"word           n length\n%s\n",{join(res,'\n',fmt:="%-14s %d %6d")})
Output:
word           n length
aaa            3      3
iii            3      3
beriberi       2      8
bilabial       2      8
caucasus       2      8
couscous       2      8
teammate       2      8
appall         2      6
emmett         2      6
hannah         2      6
murmur         2      6
tartar         2      6
testes         2      6
anna           2      4
coco           2      4
dada           2      4
deed           2      4
dodo           2      4
gogo           2      4
isis           2      4
juju           2      4
lulu           2      4
mimi           2      4
noon           2      4
otto           2      4
papa           2      4
peep           2      4
poop           2      4
teet           2      4
tete           2      4
toot           2      4
tutu           2      4
ii             2      2
ambidextrous   1     12
bluestocking   1     12
exclusionary   1     12
incomputable   1     12
lexicography   1     12
loudspeaking   1     12
malnourished   1     12
atmospheric    1     11
blameworthy    1     11
centrifugal    1     11
christendom    1     11
consumptive    1     11
countervail    1     11
countryside    1     11
countrywide    1     11
disturbance    1     11
documentary    1     11
earthmoving    1     11
exculpatory    1     11
geophysical    1     11
inscrutable    1     11
misanthrope    1     11
problematic    1     11
selfadjoint    1     11
stenography    1     11
sulfonamide    1     11
switchblade    1     11
switchboard    1     11
switzerland    1     11
thunderclap    1     11
valedictory    1     11
voluntarism    1     11

Quackery

  [ [] ]'[
    rot witheach
      [ dup nested
        unrot over do
        iff [ dip join ]
        else nip ]
    drop ]                        is filter  ( [ --> [ )

  [ 0 127 of
    swap witheach
      [ upper 2dup peek
        1+ unrot poke ]
    [] swap witheach
      [ dup iff join else drop ]
    dup [] = iff [ drop 0 ] done
    behead swap witheach
      [ over != if
          [ drop 0 conclude ] ] ] is isogram ( [ --> n )

  $ "rosetta/unixdict.txt" sharefile
  drop nest$ dup
  filter [ isogram 1 > ]
  sort$
  sortwith [ size dip size < ]
  sortwith [ isogram dip isogram < ]
  60 wrap$
  cr
  filter [ size 10  > ]
  filter [ isogram 1 = ]
  sort$
  sortwith [ size dip size < ]
  60 wrap$
  cr
Output:
aaa iii beriberi bilabial caucasus couscous teammate appall
emmett hannah murmur tartar testes anna coco dada deed dodo
gogo isis juju lulu mimi noon otto papa peep poop teet tete
toot tutu ii

ambidextrous bluestocking exclusionary incomputable
lexicography loudspeaking malnourished atmospheric
blameworthy centrifugal christendom consumptive countervail
countryside countrywide disturbance documentary earthmoving
exculpatory geophysical inscrutable misanthrope problematic
selfadjoint stenography sulfonamide switchblade switchboard
switzerland thunderclap valedictory voluntarism

Raku

my $file = 'unixdict.txt';

my @words = $file.IO.slurp.words.race.map: { $_ => .comb.Bag };

.say for (6...2).map: -> $n {
    next unless my @iso = @words.race.grep({.value.values.all == $n})».key;
    "\n({+@iso}) {$n}-isograms:\n" ~ @iso.sort({[-.chars, ~$_]}).join: "\n";
}

my $minchars = 10;

say "\n({+$_}) heterograms with more than $minchars characters:\n" ~
  .sort({[-.chars, ~$_]}).join: "\n" given
  @words.race.grep({.key.chars >$minchars && .value.values.max == 1})».key;
Output:
(2) 3-isograms:
aaa
iii

(31) 2-isograms:
beriberi
bilabial
caucasus
couscous
teammate
appall
emmett
hannah
murmur
tartar
testes
anna
coco
dada
deed
dodo
gogo
isis
juju
lulu
mimi
noon
otto
papa
peep
poop
teet
tete
toot
tutu
ii

(32) heterograms with more than 10 characters:
ambidextrous
bluestocking
exclusionary
incomputable
lexicography
loudspeaking
malnourished
atmospheric
blameworthy
centrifugal
christendom
consumptive
countervail
countryside
countrywide
disturbance
documentary
earthmoving
exculpatory
geophysical
inscrutable
misanthrope
problematic
selfadjoint
stenography
sulfonamide
switchblade
switchboard
switzerland
thunderclap
valedictory
voluntarism

Wren

Library: Wren-str
import "io" for File
import "./str" for Str

var isogram = Fn.new { |word|
    if (word.count == 1) return 1
    var map = {}
    word = Str.lower(word)
    for (c in word) {
        if (map.containsKey(c)) {
            map[c] = map[c] + 1
        } else {
            map[c] = 1
        }
    }
    var chars = map.keys.toList
    var n = map[chars[0]]
    var iso = chars[1..-1].all { |c| map[c] == n }
    return iso ? n : 0
}

var isoComparer = Fn.new { |i, j|
    if (i[1] != j[1]) return i[1] > j[1]
    if (i[0].count != j[0].count) return i[0].count > j[0].count
    return Str.le(i[0], j[0])
}

var heteroComparer = Fn.new { |i, j|
    if (i[0].count != j[0].count) return i[0].count > j[0].count
    return Str.le(i[0], j[0])
}

var wordList = "unixdict.txt" // local copy
var words = File.read(wordList)
                .trimEnd()
                .split("\n")
                .map { |word| [word, isogram.call(word)] }

var isograms = words.where { |t| t[1] > 1 }
                    .toList
                    .sort(isoComparer)
                    .map { |t| "  " + t[0] }
                    .toList
System.print("List of n-isograms(%(isograms.count)) where n > 1:")
System.print(isograms.join("\n"))

var heterograms = words.where { |t| t[1] == 1 && t[0].count > 10 }
                       .toList
                       .sort(heteroComparer)
                       .map { |t| "  " + t[0] }
                       .toList
System.print("\nList of heterograms(%(heterograms.count)) of length > 10:")
System.print(heterograms.join("\n"))
Output:
List of n-isograms(33) where n > 1:
  aaa
  iii
  beriberi
  bilabial
  caucasus
  couscous
  teammate
  appall
  emmett
  hannah
  murmur
  tartar
  testes
  anna
  coco
  dada
  deed
  dodo
  gogo
  isis
  juju
  lulu
  mimi
  noon
  otto
  papa
  peep
  poop
  teet
  tete
  toot
  tutu
  ii

List of heterograms(32) of length > 10:
  ambidextrous
  bluestocking
  exclusionary
  incomputable
  lexicography
  loudspeaking
  malnourished
  atmospheric
  blameworthy
  centrifugal
  christendom
  consumptive
  countervail
  countryside
  countrywide
  disturbance
  documentary
  earthmoving
  exculpatory
  geophysical
  inscrutable
  misanthrope
  problematic
  selfadjoint
  stenography
  sulfonamide
  switchblade
  switchboard
  switzerland
  thunderclap
  valedictory
  voluntarism