Random sentence from book: Difference between revisions

Content added Content deleted

Inline

Revision as of 21:48, 13 February 2021

Read in the book "The War of the Worlds", by H. G. Wells.
Skip to the start of the book, proper.
Remove extraneous punctuation, but keep at least sentence-ending punctuation characters . ! and ?
Keep account of what words follow words and how many times it is seen, (treat sentence terminators as words too).
Keep account of what words follow two words and how many times it is seen, (again treating sentence terminators as words too).
Assume that a sentence starts with a not to be shown full-stop character then use a weighted random choice of the possible words that may follow a full-stop to add to the sentence.
Then repeatedly add words to the sentence based on weighted random choices of what words my follow the last two words to extend the sentence.
Stop when the sentence ends in a sentence ending punctuation character.
Tidy and then print the sentence.

Show examples of random sentences generated.

Julia

<lang julia>""" weighted random pick of items in a Dict{String, Int} where keys are words, values counts """ function weightedrandompick(dict, total)

   n = rand(1:total)
   for key in keys(dict)
       n -= dict[key]
       if n <= 0
           return key
       end
   end
   return last(keys(dict))

end

let

   """ Read in the book "The War of the Worlds", by H. G. Wells. """
   wotw_uri =  "http://www.gutenberg.org/files/36/36-0.txt"
   wfile = "war_of_the_worlds.txt"
   stat(wfile).size == 0 && download(wotw_uri, wfile)  # download if file not here already
   text = read(wfile, String)

   """skip to start of book and prune end """
   startphrase, endphrase = "No one would have believed", "she has counted me, among the dead"
   text = text[findfirst(startphrase, text).start:findlast(endphrase, text).stop]

   """ Remove extraneous punctuation, but keep at least sentence-ending punctuation characters . ! and ? """
   text = replace(replace(lowercase(text), r"[^01-9a-zA-Z\.\?\!\']" => " "), r"([.?!])" => s" \1")
   words = split(text, r"\s+")

   """ Keep account of what words follow words and how many times it is seen. 
       Treat sentence terminators as words too). Keep account of what words follow two words 
       and how many times it is seen, (again treating sentence terminators as words too).
   """
   follows, follows2 = Dict{String, Dict{String, Int}}(), Dict{String, Dict{String, Int}}()
   afterstop, wlen = Dict{String, Int}(), length(words)
   for (i, w) in enumerate(@view words[1:end-1])
       if !haskey(follows, w)
           follows[w] = Dict(words[i + 1] => 1)
       else
           if haskey(follows[w], words[i + 1])
               follows[w][words[i + 1]] += 1
           else
               follows[w][words[i + 1]] = 1
           end
       end
       (i > wlen - 2) && continue
       w2 = w * " " * words[i + 1]
       if !haskey(follows2, w2)
           follows2[w2] = Dict(words[i + 2] => 1)
       else
           if haskey(follows2[w2], words[i + 2])
               follows2[w2][words[i + 2]] += 1
           else
               follows2[w2][words[i + 2]] = 1
           end
       end
       if w in [".", "?", "!"]
           if !haskey(afterstop, words[i + 1])
               afterstop[words[i + 1]] = 1
           else
               afterstop[words[i + 1]] += 1
           end
       end
   end
   followsums = Dict(key => sum(values(follows[key])) for key in keys(follows))
   follow2sums = Dict(key => sum(values(follows2[key])) for key in keys(follows2))
   afterstopsum = sum(values(afterstop))

  """
  Assume that a sentence starts with a not to be shown full-stop character
  then use a weighted random choice of the possible words that may follow a
  full-stop to add to the sentence. (Here we use '.', '?', or '!' for the full stop character.)
  """
   function makesentence()
       firstword = weightedrandompick(afterstop, afterstopsum)
       sentencewords = [firstword, weightedrandompick(follows[firstword], followsums[firstword])]
       while !(sentencewords[end] in [".", "?", "!"])
           w2 = sentencewords[end-1] * " " * sentencewords[end]
           if haskey(follows2, w2)
               push!(sentencewords, weightedrandompick(follows2[w2], follow2sums[w2]))
           else
               push!(sentencewords, weightedrandompick(afterstop, afterstopsum))
           end
       end
       sentencewords[1] = uppercase(firstword[1]) * (length(firstword) > 1 ? firstword[2:end] : "")
       println(replace(join(sentencewords[1:end-1], " ") * sentencewords[end] * "\n", " i " => " I "))
   end

   makesentence(); makesentence(); makesentence()

end

</lang>

Output:

(RUN:)

It may be lying dead there!

I can imagine them covered with smashed windows and saw the flashes of flame flashed up
and saw through a culvert.

I remember how mockingly bright the sky was still doubtful it rapped smartly against the
starlight from the sun blazed dazzling in a flash I was beginning to face these things
but later I perceived a hold on me and rapidly growing hotter.

(RUN:)

A cheese in the goods yard there ploughed through shrieking people and a blue jersey.

They were heads merely heads.

I heard a faint movement under my feet.

(RUN:)

Survivors on castle hill alive but helplessly and speechlessly drunk.

Before they were killed.

The landlord should leave his.

Python

Extended to preserve some extra "sentence pausing" characters and try and tidy-up apostrophes.

<lang python>from urllib.request import urlopen import re from string import punctuation from collections import Counter, defaultdict import random

The War of the Worlds, by H. G. Wells

text_url = 'http://www.gutenberg.org/files/36/36-0.txt' text_start = 'No one would have believed'

sentence_ending = '.!?' sentence_pausing = ',;:'

def read_book(text_url, text_start) -> str:

   with urlopen(text_url) as book:
       text = book.read().decode('utf-8')
   return text[text.index(text_start):]

def remove_punctuation(text: str, keep=sentence_ending+sentence_pausing)-> str:

   "Remove punctuation, keeping some"
   to_remove = .join(set(punctuation) - set(keep))
   text = text.translate(str.maketrans(to_remove, ' ' * len(to_remove))).strip()
   text = re.sub(fr"[^a-zA-Z0-9{keep}\n ]+", ' ', text)
   # Remove duplicates and put space around remaining punctuation
   if keep:
       text = re.sub(f"([{keep}])+", r" \1 ", text).strip()
   if text[-1] not in sentence_ending:
       text += ' .'
   return text.lower()

def word_follows_words(txt_with_pauses_and_endings):

   "return dict of freq of words following one/two words"
   words = ['.'] + txt_with_pauses_and_endings.strip().split()

   # count of what word follows this
   word2next = defaultdict(lambda :defaultdict(int))
   word2next2 = defaultdict(lambda :defaultdict(int))
   for lh, rh in zip(words, words[1:]):
       word2next[lh][rh] += 1
   for lh, mid, rh in zip(words, words[1:], words[2:]):
       word2next2[(lh, mid)][rh] += 1

   return dict(word2next), dict(word2next2)

def gen_sentence(word2next, word2next2) -> str:

   s = ['.']
   s += random.choices(*zip(*word2next[s[-1]].items()))
   while True:
       s += random.choices(*zip(*word2next2[(s[-2], s[-1])].items()))
       if s[-1] in sentence_ending:
           break

   s  = ' '.join(s[1:]).capitalize()
   s = re.sub(fr" ([{sentence_ending+sentence_pausing}])", r'\1', s)
   s = re.sub(r" re\b", "'re", s)
   s = re.sub(r" s\b", "'s", s)
   s = re.sub(r"\bi\b", "I", s)

   return s

if __name__ == "__main__":

   txt_with_pauses_and_endings = remove_punctuation(read_book(text_url, text_start))
   word2next, word2next2 = word_follows_words(txt_with_pauses_and_endings)
   #%%
   sentence = gen_sentence(word2next, word2next2)
   print(sentence)</lang>

Output:

<# A SAMPLE OF GENERATED SENTENCES

As I stood petrified and staring down the river, over which spread a multitude of dogs, I flung myself forward under the night sky, a sky of gold.

He was walking through the gaps in the water.

There was no place to their intelligence, without a word they were in position there.

Ugh!

The ringing impact of trucks, the person or entity that provided you with the torrent to recover it.