In this blog post, we embark on a journey through sentiment analysis using deep learning techniques, focusing on a dataset of IMDB movie reviews.

The Journey Begins: Data Preparation Our journey commences with data gathering and preprocessing. The provided Python code snippet showcases the process of loading IMDB movie reviews from a directory, preparing the data, and splitting it into training and validation sets. Each review is labeled as positive or negative, representing the sentiment associated with the movie.

In [ ]:
import os
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import numpy as np
from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense
In [ ]:
imdb_dir = 'aclImdb'
train_dir = os.path.join(imdb_dir, 'train')
labels = []
texts = []
for label_type in ['neg', 'pos']:
    dir_name = os.path.join(train_dir, label_type)
    for fname in os.listdir(dir_name):
        if fname[-4:] == '.txt':
            f = open(os.path.join(dir_name, fname))
            texts.append(f.read())
    #    print(fname)
            f.close()
            if label_type == 'neg':
                labels.append(0)
            else:
                labels.append(1)
In [ ]:
maxlen = 100
training_samples = 200
validation_samples = 10000
max_words = 10000
In [ ]:
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

Harnessing the Power of Tokenization and Embedding Next, we delve into the realm of tokenization and word embedding. By tokenizing the text and converting it into sequences of integers, we prepare the data for input into a deep learning model. The use of word embedding, facilitated by the Embedding layer in Keras, enables the representation of words in a continuous vector space, capturing semantic relationships and contextual information.

In [ ]:
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
data = pad_sequences(sequences, maxlen=maxlen)
labels = np.asarray(labels)
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)
indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]
x_train = data[:training_samples]
y_train = labels[:training_samples]
x_val = data[training_samples: training_samples + validation_samples]
y_val = labels[training_samples: training_samples + validation_samples]
Found 88582 unique tokens.
Shape of data tensor: (25000, 100)
Shape of label tensor: (25000,)
In [ ]:
model = Sequential()
model.add(Embedding(max_words, 32, input_length=maxlen))
model.add(Flatten())
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.summary()
Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 embedding (Embedding)       (None, 100, 32)           320000    
                                                                 
 flatten (Flatten)           (None, 3200)              0         
                                                                 
 dense (Dense)               (None, 32)                102432    
                                                                 
 dense_1 (Dense)             (None, 1)                 33        
                                                                 
=================================================================
Total params: 422,465
Trainable params: 422,465
Non-trainable params: 0
_________________________________________________________________
In [ ]:
model.compile(optimizer='rmsprop',
loss='binary_crossentropy',
metrics=['acc'])
history = model.fit(x_train, y_train,
epochs=10,
batch_size=32,
validation_data=(x_val, y_val))
model.save_weights('pre_trained_glove_model.h5')
Epoch 1/10
7/7 [==============================] - 6s 170ms/step - loss: 0.6931 - acc: 0.5000 - val_loss: 0.6926 - val_acc: 0.5166
Epoch 2/10
7/7 [==============================] - 1s 95ms/step - loss: 0.6092 - acc: 0.9850 - val_loss: 0.6920 - val_acc: 0.5191
Epoch 3/10
7/7 [==============================] - 1s 86ms/step - loss: 0.5235 - acc: 1.0000 - val_loss: 0.6928 - val_acc: 0.5165
Epoch 4/10
7/7 [==============================] - 1s 100ms/step - loss: 0.4013 - acc: 1.0000 - val_loss: 0.6946 - val_acc: 0.5201
Epoch 5/10
7/7 [==============================] - 1s 103ms/step - loss: 0.2768 - acc: 1.0000 - val_loss: 0.6978 - val_acc: 0.5197
Epoch 6/10
7/7 [==============================] - 1s 83ms/step - loss: 0.1793 - acc: 1.0000 - val_loss: 0.6977 - val_acc: 0.5224
Epoch 7/10
7/7 [==============================] - 1s 98ms/step - loss: 0.1108 - acc: 1.0000 - val_loss: 0.6998 - val_acc: 0.5260
Epoch 8/10
7/7 [==============================] - 1s 101ms/step - loss: 0.0687 - acc: 1.0000 - val_loss: 0.7053 - val_acc: 0.5290
Epoch 9/10
7/7 [==============================] - 1s 92ms/step - loss: 0.0428 - acc: 1.0000 - val_loss: 0.7066 - val_acc: 0.5301
Epoch 10/10
7/7 [==============================] - 1s 94ms/step - loss: 0.0264 - acc: 1.0000 - val_loss: 0.7130 - val_acc: 0.5290

Building the Sentiment Analysis Model With the data prepared and the groundwork laid, we construct a deep learning model for sentiment analysis. The model architecture consists of an embedding layer, followed by a flattening layer and densely connected layers with activation functions. We employ the sigmoid activation function in the final layer to predict the sentiment (positive or negative) of the movie reviews.

In [ ]:
prediction = model.predict(x_train)
print(prediction)
[[0.984327  ]
 [0.01924068]
 [0.0110822 ]
 [0.01262942]
 [0.97248036]
 [0.01588428]
 [0.9796188 ]
 [0.01627335]
 [0.99331343]
 [0.9759619 ]
 [0.01426247]
 [0.011078  ]
 [0.9853578 ]
 [0.01978219]
 [0.01340187]
 [0.98952186]
 [0.09334317]
 [0.98546576]
 [0.987548  ]
 [0.01510549]
 [0.99350953]
 [0.0184232 ]
 [0.9799543 ]
 [0.98556125]
 [0.02328515]
 [0.04164717]
 [0.01863596]
 [0.9655813 ]
 [0.9848035 ]
 [0.9886347 ]
 [0.01304111]
 [0.01048258]
 [0.02637956]
 [0.01768672]
 [0.9811418 ]
 [0.97928727]
 [0.98129344]
 [0.02160889]
 [0.01205647]
 [0.9864374 ]
 [0.97498035]
 [0.9758201 ]
 [0.01266143]
 [0.01631787]
 [0.01575083]
 [0.98028094]
 [0.01320836]
 [0.98110306]
 [0.9758854 ]
 [0.01399916]
 [0.9780923 ]
 [0.01634195]
 [0.01736605]
 [0.9789635 ]
 [0.97271097]
 [0.9748149 ]
 [0.02366084]
 [0.0201897 ]
 [0.9835024 ]
 [0.01466307]
 [0.98590803]
 [0.01941818]
 [0.01356769]
 [0.03051502]
 [0.02028313]
 [0.00762311]
 [0.9678347 ]
 [0.02212983]
 [0.01051256]
 [0.0144347 ]
 [0.01825279]
 [0.00975993]
 [0.9764074 ]
 [0.01338488]
 [0.9603822 ]
 [0.01306102]
 [0.0099377 ]
 [0.01174927]
 [0.01596433]
 [0.9799326 ]
 [0.97865736]
 [0.01460367]
 [0.98363364]
 [0.9877846 ]
 [0.01982439]
 [0.9852457 ]
 [0.9699247 ]
 [0.01035345]
 [0.96314734]
 [0.984028  ]
 [0.98670805]
 [0.01872906]
 [0.02205122]
 [0.98174816]
 [0.01516449]
 [0.98720413]
 [0.00789425]
 [0.98073363]
 [0.9890766 ]
 [0.02011073]
 [0.00833285]
 [0.9792192 ]
 [0.9824322 ]
 [0.9842001 ]
 [0.01176727]
 [0.01396188]
 [0.02634755]
 [0.9696778 ]
 [0.01192611]
 [0.01546481]
 [0.979192  ]
 [0.01560748]
 [0.984285  ]
 [0.9821718 ]
 [0.01592016]
 [0.01816052]
 [0.01553133]
 [0.02329421]
 [0.98296696]
 [0.02179718]
 [0.01944199]
 [0.01140094]
 [0.98220146]
 [0.93146443]
 [0.01563919]
 [0.98548865]
 [0.979843  ]
 [0.02142811]
 [0.98519623]
 [0.01359755]
 [0.97499335]
 [0.02284971]
 [0.98511255]
 [0.9771031 ]
 [0.984231  ]
 [0.9770901 ]
 [0.01693249]
 [0.01396808]
 [0.00913611]
 [0.02022189]
 [0.9841635 ]
 [0.01234239]
 [0.9760896 ]
 [0.9824379 ]
 [0.9791758 ]
 [0.00978518]
 [0.98478496]
 [0.9875684 ]
 [0.9809989 ]
 [0.01966113]
 [0.9819854 ]
 [0.9853251 ]
 [0.9855607 ]
 [0.0167847 ]
 [0.9800939 ]
 [0.9849909 ]
 [0.98740304]
 [0.98189616]
 [0.0109365 ]
 [0.9910811 ]
 [0.02606037]
 [0.01217309]
 [0.9854443 ]
 [0.9850347 ]
 [0.97875977]
 [0.01795807]
 [0.98023295]
 [0.01673061]
 [0.97788537]
 [0.02021855]
 [0.98494434]
 [0.9822876 ]
 [0.9879583 ]
 [0.01364279]
 [0.00830328]
 [0.01600251]
 [0.01818144]
 [0.02121943]
 [0.9741912 ]
 [0.01501963]
 [0.01438946]
 [0.00947616]
 [0.9803969 ]
 [0.01700559]
 [0.9901439 ]
 [0.9847945 ]
 [0.98321545]
 [0.04391429]
 [0.01551023]
 [0.98694754]
 [0.02409112]
 [0.9852891 ]
 [0.983088  ]
 [0.0203062 ]
 [0.978735  ]
 [0.01089823]
 [0.01818314]
 [0.9690096 ]
 [0.01068181]
 [0.01782763]]
In [ ]: