Automatic Text Generation
Automatic text generation is the generation of natural language texts by computer. It has applications in automatic documentation systems, automatic letter writing, automatic report generation, etc. In this project, we are going to generate words given a set of input words. We are going to train the LSTM model using William Shakespeare's writings. The dataset is available here.
import tensorflow as tf
import string
import requests
The get() method sends a GET request to the specified url. Here we are sending a request to get the text document of the data.
response = requests.get('https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt')
Displaying some part of the text returned by requests.get():
response.text[:1500]
'This is the 100th Etext file presented by Project Gutenberg, and\nis presented in cooperation with World Library, Inc., from their\nLibrary of the Future and Shakespeare CDROMS. Project Gutenberg\noften releases Etexts that are NOT placed in the Public Domain!!\n\nShakespeare\n\n*This Etext has certain copyright implications you should read!*\n\n>\n\n*Project Gutenberg is proud to cooperate with The World Library*\nin the presentation of The Complete Works of William Shakespeare\nfor your reading for education and entertainment. HOWEVER, THIS\nIS NEITHER SHAREWARE NOR PUBLIC DOMAIN. . .AND UNDER THE LIBRARY\nOF THE FUTURE CONDITIONS OF THIS PRESENTATION. . .NO CHARGES MAY\nBE MADE FOR *ANY* ACCESS TO THIS MATERIAL. YOU ARE ENCOURAGED!!\nTO GIVE IT AWAY TO ANYONE YOU LIKE, BUT NO CHARGES ARE ALLOWED!!\n\n\n**Welcome To The World of Free Plain Vanilla Electronic Texts**\n\n**Etexts Readable By Both Humans and By Computers, Since 1971**\n\n*These Etexts Prepared By Hundreds of Volunteers and Donations*\n\nInforma'
The character \n in the text means "newline". Splitting the text with respect to \n:
data = response.text.split('\n')
data[0]
'This is the 100th Etext file presented by Project Gutenberg, and'
The text file contains a header file before the actual data begins. The actual data begins from line 253. Slicing the data to retain everything from line 253 onwards:
data = data[253:]
data[0]
' From fairest creatures we desire increase,'
The total number of lines in our data is 124204.
len(data)
124204
Right now we have a list of the lines in the data. Joining all the lines to create a long string in continuous format:
data = " ".join(data)
data[:1000]
" From fairest creatures we desire increase, That thereby beauty's rose might never die, But as the riper should by time decease, His tender heir might bear his memory: But thou contracted to thine own bright eyes, Feed'st thy light's flame with self-substantial fuel, Making a famine where abundance lies, Thy self thy foe, to thy sweet self too cruel: Thou that art now the world's fresh ornament, And only herald to the gaudy spring, Within thine own bud buriest thy content, And tender churl mak'st waste in niggarding: Pity the world, or else this glutton be, To eat the world's due, by the grave and thee. 2 When forty winters shall besiege thy brow, And dig deep trenches in thy beauty's field, Thy youth's proud livery so gazed on now, Will be a tattered weed of small worth held: Then being asked, where all thy beauty lies, Where all the treasure of thy lusty days; To say within thine own deep sunken eyes, Were an all"
The data contains various punctuation marks. The function clean_text() removes all the punctuation marks and special characters.
Splitting the data according to space character and separating each word using split().
maketrans() function is used to construct the transition table, specifying the list of characters that need to be replaced in the whole string or the characters that need to be deleted from the string. The first parameter specifies the list of characters that need to be replaced, the second parameter specifies the list of characters with which the characters need to be replaced, the third parameter specifies the list of characters that needs to be deleted. It returns the translation table which specifies the conversions that can be used by translate().
string.punctuation is a pre-initialized string used as string constant which will give all the sets of punctuation.
To translate the characters in the string, translate() makes the translations using the translation mapping specified by maketrans().
The isalpha() method returns True if all the characters are alphabet letters (a-z). The lower() method returns the lowercased string from the given string.
After passing data to clean_text() the data is in the required format without punctuations and special characters.
def clean_text(doc):
tokens = doc.split()
table = str.maketrans('', '', string.punctuation)
tokens = [w.translate(table) for w in tokens]
tokens = [word for word in tokens if word.isalpha()]
tokens = [word.lower() for word in tokens]
return tokens
tokens = clean_text(data)
print(tokens[:50])
['from', 'fairest', 'creatures', 'we', 'desire', 'increase', 'that', 'thereby', 'beautys', 'rose', 'might', 'never', 'die', 'but', 'as', 'the', 'riper', 'should', 'by', 'time', 'decease', 'his', 'tender', 'heir', 'might', 'bear', 'his', 'memory', 'but', 'thou', 'contracted', 'to', 'thine', 'own', 'bright', 'eyes', 'feedst', 'thy', 'lights', 'flame', 'with', 'selfsubstantial', 'fuel', 'making', 'a', 'famine', 'where', 'abundance', 'lies', 'thy']
The total number of words are 898199.
len(tokens)
898199
The total number of unique words are 27956.
len(set(tokens))
27956
A set of 50 previous words predicts the 51st word in each sentence. The data is divided in chunks of 51 words, with the last word separated from every line. The dataset is limited to 200000 words.
length = 50 + 1
lines = []
for i in range(length, len(tokens)):
seq = tokens[i-length:i]
line = ' '.join(seq)
lines.append(line)
if i > 200000:
break
print(len(lines))
199951
The first line consisting of 51 words:
lines[0]
'from fairest creatures we desire increase that thereby beautys rose might never die but as the riper should by time decease his tender heir might bear his memory but thou contracted to thine own bright eyes feedst thy lights flame with selfsubstantial fuel making a famine where abundance lies thy self'
The 51st word in this line is 'self' which will the output word used for prediction.
tokens[50]
'self'
This is the second line consisting of 51 words. Hopping by one word, the 51st word in this line is 'thy', which is the output word used for prediction.
lines[1]
'fairest creatures we desire increase that thereby beautys rose might never die but as the riper should by time decease his tender heir might bear his memory but thou contracted to thine own bright eyes feedst thy lights flame with selfsubstantial fuel making a famine where abundance lies thy self thy'
Build LSTM Model and Prepare X and y
All the necessary libraries for pre-processing the data and creating neural network layers:
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Embedding
from tensorflow.keras.preprocessing.sequence import pad_sequences
A unique numerical token is created for each unique word in the dataset. fit_on_texts() updates internal vocabulary based on a list of texts. texts_to_sequences() transforms each text in texts to a sequence of integers.
tokenizer = Tokenizer()
tokenizer.fit_on_texts(lines)
sequences = tokenizer.texts_to_sequences(lines)
sequences containes a list of integer values created by tokenizer. Each line in sequences has 51 words. Each line is then split so the first 50 words go into X and the last word into y.
sequences = np.array(sequences)
X, y = sequences[:, :-1], sequences[:,-1]
X[0]
array([ 47, 1408, 1264, 37, 451, 1406, 9, 2766, 1158,
1213, 171, 132, 269, 20, 24, 1, 4782, 87,
30, 98, 4781, 18, 715, 1263, 171, 211, 18,
829, 20, 27, 3807, 4, 214, 121, 1212, 153,
13004, 31, 2765, 1847, 16, 13003, 13002, 754, 7,
3806, 99, 2430, 466, 31])
vocab_size contains all the unique words in the dataset. tokenizer.word_index gives the mapping of each unique word to its numerical equivalent. Hence len() of tokenizer.word_index gives the vocab_size.
vocab_size = len(tokenizer.word_index) + 1
to_categorical() converts a class vector (integers) to binary class matrix. num_classes is the total number of classes which is vocab_size.
y = to_categorical(y, num_classes=vocab_size)
The length of each sequence in X is 50.
seq_length = X.shape[1]
seq_length
50
LSTM Model
A Sequential model is appropriate for a plain stack of layers where each layer has exactly one input tensor and one output tensor.
Embedding layer:
The Embedding layer is initialized with random weights and will learn an embedding for all of the words in the training dataset.
It requires 3 arguments:
input_dim: This is the size of the vocabulary in the text data which isvocab_sizein this case.output_dim: This is the size of the vector space in which words will be embedded. It defines the size of the output vectors from this layer for each word.input_length: Length of input sequences which isseq_length.
LSTM layer:
This is the main layer of the model. It learns long-term dependencies between time steps in time series and sequence data. return_sequence when set to True returns the full sequence as the output.
Dense layer:
Dense layer is the regular deeply connected neural network layer. It is the most common and frequently used layer. The rectified linear activation function or relu for short is a piecewise linear function that will output the input directly if it is positive, otherwise, it will output zero.

The last layer is also a dense layer with 13009 neurons because we have to predict the probabilties of 13009 words. The activation function used is softmax. Softmax converts a real vector to a vector of categorical probabilities. The elements of the output vector are in range (0, 1) and sum to 1.

model = Sequential()
model.add(Embedding(vocab_size, 50, input_length=seq_length))
model.add(LSTM(100, return_sequences=True))
model.add(LSTM(100))
model.add(Dense(100, activation='relu'))
model.add(Dense(vocab_size, activation='softmax'))
model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding (Embedding) (None, 50, 50) 650450
_________________________________________________________________
lstm (LSTM) (None, 50, 100) 60400
_________________________________________________________________
lstm_1 (LSTM) (None, 100) 80400
_________________________________________________________________
dense (Dense) (None, 100) 10100
_________________________________________________________________
dense_1 (Dense) (None, 13009) 1313909
=================================================================
Total params: 2,115,259
Trainable params: 2,115,259
Non-trainable params: 0
_________________________________________________________________
model.compile(loss = 'categorical_crossentropy', optimizer = 'adam', metrics = ['accuracy'])
After compiling the model, training uses model.fit() on the training dataset with 100 epochs. An epoch is an iteration over the entire x and y data provided. batch_size is the number of samples per gradient update, so the weights are updated after every 256 training examples.
model.fit(X, y, batch_size = 256, epochs = 100)
Epoch 95/100
199951/199951 [==============================] - 21s 103us/sample - loss: 2.4903 - accuracy: 0.4476
Epoch 96/100
199951/199951 [==============================] - 21s 104us/sample - loss: 2.4770 - accuracy: 0.4497
Epoch 97/100
199951/199951 [==============================] - 21s 106us/sample - loss: 2.4643 - accuracy: 0.4522
Epoch 98/100
199951/199951 [==============================] - 21s 105us/sample - loss: 2.4519 - accuracy: 0.4530
Epoch 99/100
199951/199951 [==============================] - 21s 105us/sample - loss: 2.4341 - accuracy: 0.4562
Epoch 100/100
199951/199951 [==============================] - 21s 105us/sample - loss: 2.4204 - accuracy: 0.4603
Generating words using the model requires a set of 50 words to predict the 51st word. A random line from the dataset is used as a starting point:
seed_text=lines[12343]
seed_text
'home of love if i have ranged like him that travels i return again just to the time not with the time exchanged so that my self bring water for my stain never believe though in my nature reigned all frailties that besiege all kinds of blood that it could so'
generate_text_seq() generates n_words number of words after the given seed_text. The seed_text is pre-processed before predicting. Encoding uses the same encoding used for the training data, then pad_sequences() converts the seed_text to 50 words. After calling model.predict_classes(), the word is looked up in tokenizer using the index in y_predict. Finally the predicted word is appended to seed_text and text, and the process repeats.
def generate_text_seq(model, tokenizer, text_seq_length, seed_text, n_words):
text = []
for _ in range(n_words):
encoded = tokenizer.texts_to_sequences([seed_text])[0]
encoded = pad_sequences([encoded], maxlen = text_seq_length, truncating='pre')
y_predict = model.predict_classes(encoded)
predicted_word = ''
for word, index in tokenizer.word_index.items():
if index == y_predict:
predicted_word = word
break
seed_text = seed_text + ' ' + predicted_word
text.append(predicted_word)
return ' '.join(text)
The next 100 words predicted by the model for the seed_text:
generate_text_seq(model, tokenizer, seq_length, seed_text, 100)
'preposterously be stained to leave for them when thou art covetous and we pour their natural fortune grace the other fool as a monkey i cannot weep and ends you clown nay ay my lord ham aside the queen quoth thou into sport angelo to th capitol brutus patience peace night i am returnd to th field o th gout and a man that murdred of the greatest rout of her particular occasions that outruns the world side corioli did my performance gone cut the offender and take thee to you that dare not go and show me to be'
The model reaches 46% accuracy. To increase this, train on more epochs or consider the entire dataset. This model trained on only 1/4th of the available data.
Conclusion
In this tutorial you trained a stacked two-layer LSTM on 200,000 words from Shakespeare's complete works to generate new text word-by-word. After cleaning punctuation, building 51-word sliding sequences, and training for 100 epochs with categorical_crossentropy, the model reached 46% training accuracy and generated stylistically coherent Shakespearean prose such as "patience peace night i am returnd to th field."
Key takeaways:
- Using 50 context words (rather than 1 or 2) gives the LSTM enough temporal context to capture phrase-level patterns in long literary prose, improving coherence of generated output.
pad_sequenceswithtruncating='pre'discards the oldest tokens when the seed text grows beyondseq_length, keeping the most recent context window intact for prediction.- 46% word-level accuracy on a 13,009-word vocabulary is far above random chance (0.008%); to increase it further, train on the full dataset (898k tokens) and extend training to 200+ epochs.
model.predict_classes()returns the argmax token (greedy decoding); replacing it with top-k sampling at inference time produces more varied and creative output.
Next steps:
- Compare word-level generation with character-level generation in Poetry Generation using TensorFlow, Keras and LSTM to see the fluency vs. creativity trade-off.
- Replace the trained Embedding layer with pre-trained GloVe vectors to improve word representations on a limited corpus. See Words Embedding using GloVe Vectors.
- Fine-tune a GPT-2 or GPT-2-small model on the Shakespeare corpus for a modern baseline comparison.