Keras LSTM Example | Sequence Binary Classification

A sequence is a set of values where each value corresponds to an observation at a specific point in time. Sequence prediction involves using historical sequential data to predict the next value or values. Machine learning models that successfully deal with sequential data are RNN’s (Recurrent Neural Networks). 

Sequential problems are widely seen in Natural Language Processing. If you think about it, a sentence is a sequence of words in which each word represents a value at time t. You read (most of us) from left to right. The first, second, third etc words in the sentence are the values that you read sequentially to understand what is being said. 

In the following post, you will learn how to use Keras to build a sequence binary classification model using LSTM’s (a type of RNN model) and word embeddings. We will be classifying sentences into a positive or negative label. 

Get the Data

We will be approaching this problem without shortcuts. Our only help will be in preparing a dataset to apply our model to. We will be using the Large Movie Review Dataset which you can obtain from here. With the data on hand, we will be performing a series of pre-processing steps in order to convert from text to a data format our LSTM will understand.

This dataset provided by Stanford was used for writing the paper Learning Word Vectors for Sentiment Analysis. It is a widely cited paper in the NLP world and can be used to benchmark your models. 

Once you download the file and extract it, you will have the following folders.

In each train and test folders there are 2 folders, (pos and neg) which contain positive and negative movie reviews. Our goal is to learn from this labeled sentences and be able to correctly classify a review with a positive or negative label.

Data Preparation

To keep things simple, we will use an in memory solution for handling this dataset. Each folder (pos and neg) contains multiple text files where each file has a single review. We need to first combine all reviews from multiple files into a single dataset we will be keeping in memory. 

IMDB Movie Dataset LSTM

To combine all reviews into a single dataset do the following.First, we will be implementing two functions. GetTextFilePathsInDirectory provides us with the full path of all .txt files in the provided folder utilizing the os.listdir function. The second function, GetLinesFromTextFile, accepts a file path as an input and returns its contents encoded as utf-8.

import os

def GetTextFilePathsInDirectory(directory):
    files = []
    for file in os.listdir(directory):
        if file.endswith(".txt"):
            filePath = os.path.join(directory, file)
    return files

def GetLinesFromTextFile(filePath):
    with open(filePath,"r", encoding="utf-8") as f:
        lines = [line.strip() for line in f]
    return lines

Now, use the above 2 functions to obtain the positive and negative reviews into 2 lists. Below, we first get all file names from the train/pos and train/neg folders. Then, we obtain the first 500 positive and negative reviews into the reviews_positive and reviews_negative list.

positive_files = GetTextFilePathsInDirectory("aclImdb/train/pos/")
negative_files = GetTextFilePathsInDirectory("aclImdb/train/neg/")

reviews_positive = []
for i in range(0,500):
reviews_negative = []
for i in range(0,500):

 Let’s see what these reviews look like.

print("Positive Review---> {0}".format(reviews_positive[5]))
print("Negative Review---> {0}".format(reviews_negative[5]))

 A sampled positive review:

This isn't the comedic Robin Williams, nor is it the quirky/insane Robin Williams of recent thriller fame. This is a hybrid of the classic drama without over-dramatization, mixed with Robin's new love of the thriller. But this isn't a thriller, per se. This is more a mystery/suspense vehicle through which Williams attempts to locate a sick boy and his keeper.

Also starring Sandra Oh and Rory Culkin, this Suspense Drama plays pretty much like a news report, until William's character gets close to achieving his goal.

I must say that I was highly entertained, though this movie fails to teach, guide, inspect, or amuse. It felt more like I was watching a guy (Williams), as he was actually performing the actions, from a third person perspective. In other words, it felt real, and I was able to subscribe to the premise of the story.

All in all, it's worth a watch, though it's definitely not Friday/Saturday night fare.

It rates a 7.7/10 from...

the Fiend :.

A sampled negative review:

"It appears that many critics find the idea of a Woody Allen drama unpalatable." And for good reason: they are unbearably wooden and pretentious imitations of Bergman. And let's not kid ourselves: critics were mostly supportive of Allen's Bergman pretensions, Allen's whining accusations to the contrary notwithstanding. What I don't get is this: why was Allen generally applauded for his originality in imitating Bergman, but the contemporaneous Brian DePalma was excoriated for "ripping off" Hitchcock in his suspense/horror films? In Robin Wood's view, it's a strange form of cultural snobbery. I would have to agree with that.

With our 500 positive and 500 negative reviews which we will use to train our LSTM machine learning model, we can now continue with the pre-processing phase. Notice the reviews, some have html code in them, others have characters which don't provide value for to our model and we need to clean those up. 

In this article we will be running a very basic pre-processing logic to our text. I need to emphasize that this is a very important step. You can either lose information or add noise to your data if done incorrectly.

Pre-Processing Movie Reviews

In this article we will be running a very basic pre-processing logic to our text. I need to emphasize that this is a very important step. You can either lose information or add noise to your data if done incorrectly.

We will use a modified version of a clean_review function created by Aaron on github found here.

The clean_review function replaces html markup on the reviews with a space, and characters such as \ without a space. In this function, we also use the natural language python toolkit to remove stop words from the reviews. Stop words are words such as “a” that appear with high frequency in sentences without providing value.

import nltk
from nltk.corpus import stopwords
import re
REPLACE_NO_SPACE = re.compile("(\.)|(\;)|(\:)|(\!)|(\')|(\?)|(\,)|(\")|(\()|(\))|(\[)|(\])")
REPLACE_WITH_SPACE = re.compile("(<br\s*/><br\s*/>)|(\-)|(\/)")

def preprocess_reviews(reviews):
    default_stop_words = nltk.corpus.stopwords.words('english')
    stopwords = set(default_stop_words)
    reviews = [REPLACE_NO_SPACE.sub("", line.lower()) for line in reviews]
    reviews = [REPLACE_WITH_SPACE.sub(" ", line) for line in reviews]
    reviews = [RemoveStopWords(line,stopwords) for line in reviews]
    return reviews

def RemoveStopWords(line, stopwords):
    words = []
    for word in line.split(" "):
        word = word.strip()
        if word not in stopwords and word != "" and word != "&":

    return " ".join(words)

 Use the preprocess_reviews to clean our reviews as below.

reviews_positive = preprocess_reviews(reviews_positive)
reviews_negative = preprocess_reviews(reviews_negative)

Now, our positive and negative reviews have been cleaned, removing unwanted characters, stopwords and converting text to lower case.

Label our Input Sentences

With our positive and negative reviews preprocessed, we will now be adding a label which we will train our binary classifier to predict. We will use 1 for a positive review and 0 for a negative review. 

To do so, we will use numpy to generate a vector of ones and a vector of zeros with a length equal to the length of our reviews_positive and reviews_negative.

Lastly we use the python zip function to combine our reviews with our labels. Since zip returns an iterator, we then convert this iterator to a list. Do so using the below code.

import numpy as np
Reviews_Labeled = list(zip(reviews_positive, np.ones(len(reviews_positive))))
Reviews_Labeled.extend(list(zip(reviews_negative, np.zeros(len(reviews_negative)))))

Bag of Words Model

Before we can input our data to our LSTM model, we need to convert words to numbers that our model can understand. For this, we will be using a bag of words model. You can read more about bag of words here.

In order to effectively handle this, I will provide you with a class to help us with this task. The name of the class is Vocabulary. It will help us with common tasks in preparing text to a numeric form to utilize in machine learning.

This class will generate our bag of words model and provide us with methods to convert between text to integers and vice-versa. The class is the following: 

from collections import Counter
import json
class Vocabulary:
    def __init__(self, vocabulary, wordFrequencyFilePath):
        self.vocabulary = vocabulary
        self.WORD_FREQUENCY_FILE_FULL_PATH = wordFrequencyFilePath
        self.input_word_index = {}
        self.reverse_input_word_index = {}
        self.MaxSentenceLength = None
    def PrepareVocabulary(self,reviews):
        self.MaxSentenceLength = max([len(txt.split(" ")) for txt in reviews])
    def Get_Top_Words(self, number_words = None):
        if number_words == None:
            number_words = self.vocabulary
        chars = json.loads(open(self.WORD_FREQUENCY_FILE_FULL_PATH).read())
        counter = Counter(chars)
        most_popular_words = {key for key, _value in counter.most_common(number_words)}
        return most_popular_words
    def _prepare_Word_Frequency_Count_File(self,reviews):
        counter = Counter()    
        for s in reviews:
            counter.update(s.split(" "))
        with open(self.WORD_FREQUENCY_FILE_FULL_PATH, 'w') as output_file:
    def _create_Vocab_Indexes(self):
        INPUT_WORDS = self.Get_Top_Words(self.vocabulary)

        for i, word in enumerate(INPUT_WORDS):
            self.input_word_index[word] = i
        for word, i in self.input_word_index.items():
            self.reverse_input_word_index[i] = word
    def TransformSentencesToId(self, sentences):
        vectors = []
        for r in sentences:
            words = r.split(" ")
            vector = np.zeros(len(words))

            for t, word in enumerate(words):
                if word in self.input_word_index:
                    vector[t] = self.input_word_index[word]
        return vectors
    def ReverseTransformSentencesToId(self, sentences):
        vectors = []
        for r in sentences:
            words = r.split(" ")
            vector = np.zeros(len(words))

            for t, word in enumerate(words):
                if word in self.input_word_index:
                    vector[t] = self.input_word_index[word]
                    #vector[t] = 2 #unk
        return vectors

Now, lets instatiante our vocabulary. The Vocabulary class constructor takes 2 variables. First, an integer called vocabulary to determine how many words will your vocabulary be composed of It goes through the internal built bag of words model and chooses the most common words up till your vocabulary length. In this article, we will be using a vocabulary of the most common 500 words. The second variable is the full path of where to store the vocabulary file (this is the bag of words).

Instantiate our vocabulary as below using the most common 500 words to serve as our vocabulary. Then, run the PrepareVocabulary method and provide it with a list of reviews. Becuase we had previously added a label, we use a list comprehension method to obtain only the reviews from our Reviews_Labeled object.

vocab = Vocabulary(TOP_WORDS,"analysis.vocab")

reviews_text = [line[0] for line in Reviews_Labeled]

Integer Encode Words

Next, we use our Vocabulary class to convert our sentences from words to integers. In this step we convert each word in our reviews into an integer using the TransformSentencesToId function of our Vocabulary class. Do so as below.

reviews_int = vocab.TransformSentencesToId(reviews)

Reviews_Labeled_Int = list(zip(reviews_int,labels))

The Reviews_Labeled_Int class now holds sentences where instead of words, each number represents a word. A sentence now looks like this.

Bag of Words Sentence to Int

Split Train and Test

We then split our Reviews_Labeled_Int into a training and test dataset using the commonly used sklearn function called train_test_split using 20% of testing and 80% for training. 

from sklearn.model_selection import train_test_split
train, test = train_test_split(Reviews_Labeled_Int, test_size=0.2)

 Lastly, unzip our train and test data into our X and Y vectors. X are the inputs and Y are the labels that we are trying to predict.

X_train, y_train = list(zip(*train))
X_test, y_test = list(zip(*test))

y_train = np.array(y_train)
y_test = np.array(y_test)

Pad Sentences with Keras

All our X vectors need to be of the same length for our RNN model to work. Because some sentences are longer than others, we will use a function provided by Keras to pad the sentences with leading zeros in order to make them the same length.

For this article, we will use a length of 500 words defined in our max_review_length variable. Any sentence with more than 500 words will be truncated, any sentence with less than 500 words will be added leading zeros until the vector is of length 500. 

Keras provides us with a pad_sequences function to make this easy. Run the below code to pad our X_train and X_test vectors. 

from keras.preprocessing import sequence 
max_review_length = 500 
X_train = sequence.pad_sequences(X_train, maxlen=max_review_length) 
X_test = sequence.pad_sequences(X_test, maxlen=max_review_length) 

Keras LSTM model with Word Embeddings

Most of our code so far has been for pre-processing our data. The modeling side of things is made easy thanks to Keras and the many researchers behind RNN models.

To create our LSTM model with a word embedding layer we create a sequential keras model. Add an embedding layer with a vocabulary length of 500 (we defined this previously). Our embedding vector length will keep at 32 and our input_length will equal to our X vector length defined and padded to 500 words. 

The next layer is a simple LSTM layer of 100 units. Because our task is a binary classification, the last layer will be a dense layer with a sigmoid activation function.

The loss function we use is the binary_crossentropy using an adam optimizer. We define Keras to show us an accuracy metric. At the end we print a summary of our model.

from keras.models import Sequential
from keras.layers import Dense, Activation, Embedding, LSTM
import keras 

embedding_vector_length = 32
model = Sequential()
model.add(Embedding(TOP_WORDS, embedding_vector_length, input_length=max_review_length))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

 Lastly, lets train our machine learning RNN model for 10 epochs and a batch size of 64., y_train, validation_data=(X_test, y_test), epochs=10, batch_size=64)

 RNN LSTM Model Accuracy


After training, this simple model takes us to an accuracy of nearly 80%. There is still much more that can be done to improve this model. You can increase the vocabulary, add more training samples, add regularization, improve the pre-processing stage and so on. Now you are armed with how to use Keras to build an LSTM model that can perform binary classification on sequential data such as sentences. Stay tuned for more!