Talk:Random sentence from book

From Rosetta Code

Markov chain?

Isn't this pretty much exactly Markov chain text generator? Other than being run on War of the Worlds rather than Alice in Wonderland and Wizard of Oz? I mean, it is almost exactly the same instructions; this task just omits mentioning the fact that it is Markov chaining. --Thundergnat (talk) 19:46, 12 February 2021 (UTC)

I didn't remember that when creating this task. There is however a notable difference:
"... use a weighted random choice ..."
In this task words that are more likely to follow words, should be more likely to occur next, These weights need accumulating and applying in this task whereas that task does not require it.
--Paddy3118 (talk) 10:56, 15 February 2021 (UTC)

Some stats

I was thinking of extending the Python example to weight words following two, three, ... other words but after a while, I thought you would constrain things so that you only generated sentences that are actually in the book!
I decided instead to find out how many of the generated random sentences exist in the book for the current Python code by appending this snippet: <lang python>#%% Sentence counts

def gen_simple_sentence(word2next=word2next, word2next2=word2next2) -> str:

   "No tidying up of generated word sequence of sentence"
   s = ['.']
   s += random.choices(*zip(*word2next[s[-1]].items()))
   while True:
       s += random.choices(*zip(*word2next2[(s[-2], s[-1])].items()))
       if s[-1] in sentence_ending:
           break
   return ' '.join(s[1:])


if 1:

   N = 1_000
   words = ['.'] + txt_with_pauses_and_endings.strip().split()
   sent_count  = sum(words.count(punct) for punct in sentence_ending) - 1
   pause_count = sum(words.count(punct) for punct in sentence_pausing)
   avg_words_in_sent = (len(words) - 1 - pause_count
                        - words.count('re') - words.count('s')) / sent_count
   print(f'\nSentences in the book have ~{avg_words_in_sent:.1f}, words')
   book = ' '.join(words)  # Now sanitised
   copies = sum(gen_simple_sentence() in book for _ in range(N))
   print(f"Generating {N:_} random sentences produced {copies:_}"
         " that are actually in the book")</lang>
The average sentence length is approx. 19 words and around 15% of the generated sentences actually occur in the book.