Shannon's Linguistic Playground

By Stefan Szeider

The Shannon Text Generator is an interactive demonstration of Claude Shannon's groundbreaking concepts from his 1948 paper “A Mathematical Theory of Communication.” Shannon's work with Markov processes laid the foundation for modern natural language processing and generative AI. While today's transformer models like ChatGPT have evolved far beyond these simple statistical methods, they still build upon Shannon's fundamental insights about information entropy and probabilistic language modeling. This tool lets you experiment with variable-order Markov chains to generate text that statistically resembles its source material—a glimpse into the core principles that eventually led to today's sophisticated language models.

Interactive Experiments

This interactive web application implements Shannon's concept of using Markov processes to generate text that statistically resembles the original source. The generator analyzes your input text by creating a map of character sequences and their possible following characters. For example, with a Markov order (k) of 3, it examines every three-character sequence in the text and records which characters follow each sequence. Generating text starts with a sequence (either randomly selected or provided). It then randomly selects the next character based on the statistical patterns found in the original text. With each new character shifting the sequence window forward, this process repeats, creating a chain of probabilistically related characters that resembles the original text's statistical properties.

Example 1

Let's take the text of Little Red Riding Hood.

Unigram (k=1): Hop Hoed Rig, Gof Alfr HOOOOD OOnoood he cl," "Whthe thend, Ricry d Shodorer." ITLI wid y Re t sat tlood sme, "Wod kerandmod angrs Litll Rith s ithe, Alys s yoe u t in wid dmoosche peasthe he atersthaf m y idmod?

Bigram (k=2): All the he wice Riding Hood?" In and Red who shen spre litter alk a pas pas then Mr. He she Rid flothers of the mad caloaked Rid.

Trigram (k=3): Everybody lived Riding Hood. In andmothe bed Riding to see you goings. Once and cake and crept under mout in, so picked ther.

4-gram (k=4): Four Everybody loved had dropped at the village. "Little Red Riding Hood's father lay in throught and gave brough then the grandmother grandmother.

5-gram (k=5): At last he reached this basket, Little Red Riding Hood?" "I am going with a hood was so happily ever after.

Example 2

Let's take the most popular baby names in Austria of 2010 (format boys/girls).

Unigram (k=1): Lon, Do, Pamobantieons, Liavid, Chameonuiaur, Chan, Olikterchas, Hast, Stz, Dat, Hel, Erdorichian, Lested, Anjan / Soph, Thilerina, Ma, Misa, Sohisa, Ole, Nia, Gria, Aliare, Ela, Idra, Che, Miesa, Soh, Olianna, Ma, Ma, Minzia, Lela.

Bigram (k=2): Luca, Antincenry, Marco, Dammed, Arthael, Lian, Louistias, Hugo, Car, Domastian, Chritz, Man, Sebas, Sebas, Fabias / Sophie, Sophie, Melia, Marah, Viktonie, Julia, Lena, Vanna, Johanna, Daria, Sophie, Leonia, Eda, Rosa, Antorie.

Trigram (k=3): Lukas, Matthias, Niklas, Henry, Manuel, Liam, Simon, Jonardo, Oliver, Felix, Julian, Elian, Tim, Hugo, Damian, Erik / Sophie, Hannah, Clara, Melisa, Melina, Amelia, Sophie, Isabeth, Viktoria, Hanna, Stella, Pia, Melina, Lisabeth.