Text Generation using Tensorflow, Keras and LSTM

Generate Shakespearean text using stacked LSTM in TensorFlow. Covers corpus cleaning, tokenization, sequence preparation, Embedding layer, and word prediction.

Aug 31, 2020Updated May 23, 202629 min readFollow

Topics You Will Master

Raw corpus loading: Shakespeare text cleaning and normalization
Tokenization and integer encoding of the full vocabulary
Fixed-length sequence preparation for next-word prediction training
Stacked LSTM with Embedding layer for language modeling

Automatic Text Generation

Automatic text generation is the generation of natural language texts by computer. It has applications in automatic documentation systems, automatic letter writing, automatic report generation, etc. In this project, we are going to generate words given a set of input words. We are going to train the LSTM model using William Shakespeare's writings. The dataset is available here.

PYTHON
import tensorflow as tf
import string
import requests

The get() method sends a GET request to the specified url. Here we are sending a request to get the text document of the data.

PYTHON
response = requests.get('https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt')

Displaying some part of the text returned by requests.get():

PYTHON
response.text[:1500]
OUTPUT
'This is the 100th Etext file presented by Project Gutenberg, and\nis presented in cooperation with World Library, Inc., from their\nLibrary of the Future and Shakespeare CDROMS.  Project Gutenberg\noften releases Etexts that are NOT placed in the Public Domain!!\n\nShakespeare\n\n*This Etext has certain copyright implications you should read!*\n\n>\n\n*Project Gutenberg is proud to cooperate with The World Library*\nin the presentation of The Complete Works of William Shakespeare\nfor your reading for education and entertainment.  HOWEVER, THIS\nIS NEITHER SHAREWARE NOR PUBLIC DOMAIN. . .AND UNDER THE LIBRARY\nOF THE FUTURE CONDITIONS OF THIS PRESENTATION. . .NO CHARGES MAY\nBE MADE FOR *ANY* ACCESS TO THIS MATERIAL.  YOU ARE ENCOURAGED!!\nTO GIVE IT AWAY TO ANYONE YOU LIKE, BUT NO CHARGES ARE ALLOWED!!\n\n\n**Welcome To The World of Free Plain Vanilla Electronic Texts**\n\n**Etexts Readable By Both Humans and By Computers, Since 1971**\n\n*These Etexts Prepared By Hundreds of Volunteers and Donations*\n\nInforma'

The character \n in the text means "newline". Splitting the text with respect to \n:

PYTHON
data = response.text.split('\n')
data[0]
OUTPUT
'This is the 100th Etext file presented by Project Gutenberg, and'

The text file contains a header file before the actual data begins. The actual data begins from line 253. Slicing the data to retain everything from line 253 onwards:

PYTHON
data = data[253:]
data[0]
OUTPUT
'  From fairest creatures we desire increase,'

The total number of lines in our data is 124204.

PYTHON
len(data)
OUTPUT
124204

Right now we have a list of the lines in the data. Joining all the lines to create a long string in continuous format:

PYTHON
data = " ".join(data)
data[:1000]
OUTPUT
"  From fairest creatures we desire increase,   That thereby beauty's rose might never die,   But as the riper should by time decease,   His tender heir might bear his memory:   But thou contracted to thine own bright eyes,   Feed'st thy light's flame with self-substantial fuel,   Making a famine where abundance lies,   Thy self thy foe, to thy sweet self too cruel:   Thou that art now the world's fresh ornament,   And only herald to the gaudy spring,   Within thine own bud buriest thy content,   And tender churl mak'st waste in niggarding:     Pity the world, or else this glutton be,     To eat the world's due, by the grave and thee.                        2   When forty winters shall besiege thy brow,   And dig deep trenches in thy beauty's field,   Thy youth's proud livery so gazed on now,   Will be a tattered weed of small worth held:     Then being asked, where all thy beauty lies,   Where all the treasure of thy lusty days;   To say within thine own deep sunken eyes,   Were an all"

The data contains various punctuation marks. The function clean_text() removes all the punctuation marks and special characters.

Splitting the data according to space character and separating each word using split().

maketrans() function is used to construct the transition table, specifying the list of characters that need to be replaced in the whole string or the characters that need to be deleted from the string. The first parameter specifies the list of characters that need to be replaced, the second parameter specifies the list of characters with which the characters need to be replaced, the third parameter specifies the list of characters that needs to be deleted. It returns the translation table which specifies the conversions that can be used by translate().

string.punctuation is a pre-initialized string used as string constant which will give all the sets of punctuation.

To translate the characters in the string, translate() makes the translations using the translation mapping specified by maketrans().

The isalpha() method returns True if all the characters are alphabet letters (a-z). The lower() method returns the lowercased string from the given string.

After passing data to clean_text() the data is in the required format without punctuations and special characters.

PYTHON
def clean_text(doc):
  tokens = doc.split()
  table = str.maketrans('', '', string.punctuation)
  tokens = [w.translate(table) for w in tokens]
  tokens = [word for word in tokens if word.isalpha()]
  tokens = [word.lower() for word in tokens]
  return tokens

tokens = clean_text(data)
print(tokens[:50])
OUTPUT
['from', 'fairest', 'creatures', 'we', 'desire', 'increase', 'that', 'thereby', 'beautys', 'rose', 'might', 'never', 'die', 'but', 'as', 'the', 'riper', 'should', 'by', 'time', 'decease', 'his', 'tender', 'heir', 'might', 'bear', 'his', 'memory', 'but', 'thou', 'contracted', 'to', 'thine', 'own', 'bright', 'eyes', 'feedst', 'thy', 'lights', 'flame', 'with', 'selfsubstantial', 'fuel', 'making', 'a', 'famine', 'where', 'abundance', 'lies', 'thy']

The total number of words are 898199.

PYTHON
len(tokens)
OUTPUT
898199

The total number of unique words are 27956.

PYTHON
len(set(tokens))
OUTPUT
27956

A set of 50 previous words predicts the 51st word in each sentence. The data is divided in chunks of 51 words, with the last word separated from every line. The dataset is limited to 200000 words.

PYTHON
length = 50 + 1
lines = []

for i in range(length, len(tokens)):
  seq = tokens[i-length:i]
  line = ' '.join(seq)
  lines.append(line)
  if i > 200000:
    break

print(len(lines))
OUTPUT
199951

The first line consisting of 51 words:

PYTHON
lines[0]
OUTPUT
'from fairest creatures we desire increase that thereby beautys rose might never die but as the riper should by time decease his tender heir might bear his memory but thou contracted to thine own bright eyes feedst thy lights flame with selfsubstantial fuel making a famine where abundance lies thy self'

The 51st word in this line is 'self' which will the output word used for prediction.

PYTHON
tokens[50]
OUTPUT
'self'

This is the second line consisting of 51 words. Hopping by one word, the 51st word in this line is 'thy', which is the output word used for prediction.

PYTHON
lines[1]
OUTPUT
'fairest creatures we desire increase that thereby beautys rose might never die but as the riper should by time decease his tender heir might bear his memory but thou contracted to thine own bright eyes feedst thy lights flame with selfsubstantial fuel making a famine where abundance lies thy self thy'

Build LSTM Model and Prepare X and y

All the necessary libraries for pre-processing the data and creating neural network layers:

PYTHON
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Embedding
from tensorflow.keras.preprocessing.sequence import pad_sequences

A unique numerical token is created for each unique word in the dataset. fit_on_texts() updates internal vocabulary based on a list of texts. texts_to_sequences() transforms each text in texts to a sequence of integers.

PYTHON
tokenizer = Tokenizer()
tokenizer.fit_on_texts(lines)
sequences = tokenizer.texts_to_sequences(lines)

sequences containes a list of integer values created by tokenizer. Each line in sequences has 51 words. Each line is then split so the first 50 words go into X and the last word into y.

PYTHON
sequences = np.array(sequences)
X, y = sequences[:, :-1], sequences[:,-1]
X[0]
OUTPUT
array([   47,  1408,  1264,    37,   451,  1406,     9,  2766,  1158,
        1213,   171,   132,   269,    20,    24,     1,  4782,    87,
          30,    98,  4781,    18,   715,  1263,   171,   211,    18,
         829,    20,    27,  3807,     4,   214,   121,  1212,   153,
       13004,    31,  2765,  1847,    16, 13003, 13002,   754,     7,
        3806,    99,  2430,   466,    31])

vocab_size contains all the unique words in the dataset. tokenizer.word_index gives the mapping of each unique word to its numerical equivalent. Hence len() of tokenizer.word_index gives the vocab_size.

PYTHON
vocab_size = len(tokenizer.word_index) + 1

to_categorical() converts a class vector (integers) to binary class matrix. num_classes is the total number of classes which is vocab_size.

PYTHON
y = to_categorical(y, num_classes=vocab_size)

The length of each sequence in X is 50.

PYTHON
seq_length = X.shape[1]
seq_length
OUTPUT
50

LSTM Model

A Sequential model is appropriate for a plain stack of layers where each layer has exactly one input tensor and one output tensor.

Embedding layer:

The Embedding layer is initialized with random weights and will learn an embedding for all of the words in the training dataset.

It requires 3 arguments:

  • input_dim: This is the size of the vocabulary in the text data which is vocab_size in this case.
  • output_dim: This is the size of the vector space in which words will be embedded. It defines the size of the output vectors from this layer for each word.
  • input_length: Length of input sequences which is seq_length.

LSTM layer:

This is the main layer of the model. It learns long-term dependencies between time steps in time series and sequence data. return_sequence when set to True returns the full sequence as the output.

Dense layer:

Dense layer is the regular deeply connected neural network layer. It is the most common and frequently used layer. The rectified linear activation function or relu for short is a piecewise linear function that will output the input directly if it is positive, otherwise, it will output zero.

ReLU activation function graph showing zero output for negative inputs and linear output for positive values

The last layer is also a dense layer with 13009 neurons because we have to predict the probabilties of 13009 words. The activation function used is softmax. Softmax converts a real vector to a vector of categorical probabilities. The elements of the output vector are in range (0, 1) and sum to 1.

Sigmoid activation function graph showing the S-shaped curve mapping any input to a value between 0 and 1

PYTHON
model = Sequential()
model.add(Embedding(vocab_size, 50, input_length=seq_length))
model.add(LSTM(100, return_sequences=True))
model.add(LSTM(100))
model.add(Dense(100, activation='relu'))
model.add(Dense(vocab_size, activation='softmax'))
PYTHON
model.summary()
PYTHON
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
embedding (Embedding)        (None, 50, 50)            650450
_________________________________________________________________
lstm (LSTM)                  (None, 50, 100)           60400
_________________________________________________________________
lstm_1 (LSTM)                (None, 100)               80400
_________________________________________________________________
dense (Dense)                (None, 100)               10100
_________________________________________________________________
dense_1 (Dense)              (None, 13009)             1313909
=================================================================
Total params: 2,115,259
Trainable params: 2,115,259
Non-trainable params: 0
_________________________________________________________________
PYTHON
model.compile(loss = 'categorical_crossentropy', optimizer = 'adam', metrics = ['accuracy'])

After compiling the model, training uses model.fit() on the training dataset with 100 epochs. An epoch is an iteration over the entire x and y data provided. batch_size is the number of samples per gradient update, so the weights are updated after every 256 training examples.

PYTHON
model.fit(X, y, batch_size = 256, epochs = 100)
OUTPUT
Epoch 95/100
199951/199951 [==============================] - 21s 103us/sample - loss: 2.4903 - accuracy: 0.4476
Epoch 96/100
199951/199951 [==============================] - 21s 104us/sample - loss: 2.4770 - accuracy: 0.4497
Epoch 97/100
199951/199951 [==============================] - 21s 106us/sample - loss: 2.4643 - accuracy: 0.4522
Epoch 98/100
199951/199951 [==============================] - 21s 105us/sample - loss: 2.4519 - accuracy: 0.4530
Epoch 99/100
199951/199951 [==============================] - 21s 105us/sample - loss: 2.4341 - accuracy: 0.4562
Epoch 100/100
199951/199951 [==============================] - 21s 105us/sample - loss: 2.4204 - accuracy: 0.4603

Generating words using the model requires a set of 50 words to predict the 51st word. A random line from the dataset is used as a starting point:

PYTHON
seed_text=lines[12343]
seed_text
OUTPUT
'home of love if i have ranged like him that travels i return again just to the time not with the time exchanged so that my self bring water for my stain never believe though in my nature reigned all frailties that besiege all kinds of blood that it could so'

generate_text_seq() generates n_words number of words after the given seed_text. The seed_text is pre-processed before predicting. Encoding uses the same encoding used for the training data, then pad_sequences() converts the seed_text to 50 words. After calling model.predict_classes(), the word is looked up in tokenizer using the index in y_predict. Finally the predicted word is appended to seed_text and text, and the process repeats.

PYTHON
def generate_text_seq(model, tokenizer, text_seq_length, seed_text, n_words):
  text = []

  for _ in range(n_words):
    encoded = tokenizer.texts_to_sequences([seed_text])[0]
    encoded = pad_sequences([encoded], maxlen = text_seq_length, truncating='pre')

    y_predict = model.predict_classes(encoded)

    predicted_word = ''
    for word, index in tokenizer.word_index.items():
      if index == y_predict:
        predicted_word = word
        break
    seed_text = seed_text + ' ' + predicted_word
    text.append(predicted_word)
  return ' '.join(text)

The next 100 words predicted by the model for the seed_text:

PYTHON
generate_text_seq(model, tokenizer, seq_length, seed_text, 100)
OUTPUT
'preposterously be stained to leave for them when thou art covetous and we pour their natural fortune grace the other fool as a monkey i cannot weep and ends you clown nay ay my lord ham aside the queen quoth thou into sport angelo to th capitol brutus patience peace night i am returnd to th field o th gout and a man that murdred of the greatest rout of her particular occasions that outruns the world side corioli did my performance gone cut the offender and take thee to you that dare not go and show me to be'

The model reaches 46% accuracy. To increase this, train on more epochs or consider the entire dataset. This model trained on only 1/4th of the available data.

Conclusion

In this tutorial you trained a stacked two-layer LSTM on 200,000 words from Shakespeare's complete works to generate new text word-by-word. After cleaning punctuation, building 51-word sliding sequences, and training for 100 epochs with categorical_crossentropy, the model reached 46% training accuracy and generated stylistically coherent Shakespearean prose such as "patience peace night i am returnd to th field."

Key takeaways:

  • Using 50 context words (rather than 1 or 2) gives the LSTM enough temporal context to capture phrase-level patterns in long literary prose, improving coherence of generated output.
  • pad_sequences with truncating='pre' discards the oldest tokens when the seed text grows beyond seq_length, keeping the most recent context window intact for prediction.
  • 46% word-level accuracy on a 13,009-word vocabulary is far above random chance (0.008%); to increase it further, train on the full dataset (898k tokens) and extend training to 200+ epochs.
  • model.predict_classes() returns the argmax token (greedy decoding); replacing it with top-k sampling at inference time produces more varied and creative output.

Next steps:

Found this useful? Keep building with me.

New tutorials every week on YouTube: or go deeper with a full structured course.

Find this tutorial useful?

Subscribe to our YouTube channels for more practical production walk-throughs.

Discussion & Comments