Random sentence from book

From Rosetta Code
Revision as of 19:06, 12 February 2021 by rosettacode>Paddy3118 (New draft task with Python solution)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Random sentence from book is a draft programming task. It is not yet considered ready to be promoted as a complete task, for reasons that should be found in its talk page.
  • Read in the book "The War of the Worlds", by H. G. Wells.
  • Skip to the start of the book, proper.
  • Remove extraneous punctuation, but keep at least sentence-ending punctuation characters . ! and ?
  • Keep account of what words follow words and how many times it is seen, (treat sentence terminators as words too).
  • Keep account of what words follow two words and how many times it is seen, (again treating sentence terminators as words too).
  • Assume that a sentence starts with a not to be shown full-stop character then use a weighted random choice of the possible words that may follow a full-stop to add to the sentence.
  • Then repeatedly add words to the sentence based on weighted random choices of what words my follow the last two words to extend the sentence.
  • Stop when the sentence ends in a sentence ending punctuation character.
  • Tidy and then print the sentence.

Show examples of random sentences generated.

Python

Extended to preserve some extra "sentence pausing" characters and try and tidy-up apostrophes.

<lang python>from urllib.request import urlopen import re from string import punctuation from collections import Counter, defaultdict import random


  1. The War of the Worlds, by H. G. Wells

text_url = 'http://www.gutenberg.org/files/36/36-0.txt' text_start = 'No one would have believed'

sentence_ending = '.!?' sentence_pausing = ',;:'

def read_book(text_url, text_start) -> str:

   with urlopen(text_url) as book:
       text = book.read().decode('utf-8')
   return text[text.index(text_start):]

def remove_punctuation(text: str, keep=sentence_ending+sentence_pausing)-> str:

   "Remove punctuation, keeping some"
   to_remove = .join(set(punctuation) - set(keep))
   text = text.translate(str.maketrans(to_remove, ' ' * len(to_remove))).strip()
   text = re.sub(fr"[^a-zA-Z0-9{keep}\n ]+", ' ', text)
   # Remove duplicates and put space around remaining punctuation
   if keep:
       text = re.sub(f"([{keep}])+", r" \1 ", text).strip()
   if text[-1] not in sentence_ending:
       text += ' .'
   return text.lower()

def word_follows_words(txt_with_pauses_and_endings):

   "return dict of freq of words following one/two words"
   words = ['.'] + txt_with_pauses_and_endings.strip().split()
   # count of what word follows this
   word2next = defaultdict(lambda :defaultdict(int))
   word2next2 = defaultdict(lambda :defaultdict(int))
   for lh, rh in zip(words, words[1:]):
       word2next[lh][rh] += 1
   for lh, mid, rh in zip(words, words[1:], words[2:]):
       word2next2[(lh, mid)][rh] += 1
   return dict(word2next), dict(word2next2)

def gen_sentence(word2next, word2next2) -> str:

   s = ['.']
   s += random.choices(*zip(*word2next[s[-1]].items()))
   while True:
       s += random.choices(*zip(*word2next2[(s[-2], s[-1])].items()))
       if s[-1] in sentence_ending:
           break
   s  = ' '.join(s[1:]).capitalize()
   s = re.sub(fr" ([{sentence_ending+sentence_pausing}])", r'\1', s)
   s = re.sub(r" re\b", "'re", s)
   s = re.sub(r" s\b", "'s", s)
   s = re.sub(r"\bi\b", "I", s)
   return s

if __name__ == "__main__":

   txt_with_pauses_and_endings = remove_punctuation(read_book(text_url, text_start))
   word2next, word2next2 = word_follows_words(txt_with_pauses_and_endings)
   #%%
   sentence = gen_sentence(word2next, word2next2)
   print(sentence)</lang>
Output:
<# A SAMPLE OF GENERATED SENTENCES

As I stood petrified and staring down the river, over which spread a multitude of dogs, I flung myself forward under the night sky, a sky of gold.

He was walking through the gaps in the water.

There was no place to their intelligence, without a word they were in position there.

Ugh!

The ringing impact of trucks, the person or entity that provided you with the torrent to recover it.
/pre>