In this blog post, we embark on a journey through sentiment analysis using deep learning techniques, focusing on a dataset of IMDB movie reviews.
The Journey Begins: Data Preparation Our journey commences with data gathering and preprocessing. The provided Python code snippet showcases the process of loading IMDB movie reviews from a directory, preparing the data, and splitting it into training and validation sets. Each review is labeled as positive or negative, representing the sentiment associated with the movie.
import os
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import numpy as np
from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense
imdb_dir = 'aclImdb'
train_dir = os.path.join(imdb_dir, 'train')
labels = []
texts = []
for label_type in ['neg', 'pos']:
dir_name = os.path.join(train_dir, label_type)
for fname in os.listdir(dir_name):
if fname[-4:] == '.txt':
f = open(os.path.join(dir_name, fname))
texts.append(f.read())
# print(fname)
f.close()
if label_type == 'neg':
labels.append(0)
else:
labels.append(1)
maxlen = 100
training_samples = 200
validation_samples = 10000
max_words = 10000
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
Harnessing the Power of Tokenization and Embedding Next, we delve into the realm of tokenization and word embedding. By tokenizing the text and converting it into sequences of integers, we prepare the data for input into a deep learning model. The use of word embedding, facilitated by the Embedding layer in Keras, enables the representation of words in a continuous vector space, capturing semantic relationships and contextual information.
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
data = pad_sequences(sequences, maxlen=maxlen)
labels = np.asarray(labels)
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)
indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]
x_train = data[:training_samples]
y_train = labels[:training_samples]
x_val = data[training_samples: training_samples + validation_samples]
y_val = labels[training_samples: training_samples + validation_samples]
Found 88582 unique tokens. Shape of data tensor: (25000, 100) Shape of label tensor: (25000,)
model = Sequential()
model.add(Embedding(max_words, 32, input_length=maxlen))
model.add(Flatten())
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.summary()
Model: "sequential_2" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding (Embedding) (None, 100, 32) 320000 flatten (Flatten) (None, 3200) 0 dense (Dense) (None, 32) 102432 dense_1 (Dense) (None, 1) 33 ================================================================= Total params: 422,465 Trainable params: 422,465 Non-trainable params: 0 _________________________________________________________________
model.compile(optimizer='rmsprop',
loss='binary_crossentropy',
metrics=['acc'])
history = model.fit(x_train, y_train,
epochs=10,
batch_size=32,
validation_data=(x_val, y_val))
model.save_weights('pre_trained_glove_model.h5')
Epoch 1/10 7/7 [==============================] - 6s 170ms/step - loss: 0.6931 - acc: 0.5000 - val_loss: 0.6926 - val_acc: 0.5166 Epoch 2/10 7/7 [==============================] - 1s 95ms/step - loss: 0.6092 - acc: 0.9850 - val_loss: 0.6920 - val_acc: 0.5191 Epoch 3/10 7/7 [==============================] - 1s 86ms/step - loss: 0.5235 - acc: 1.0000 - val_loss: 0.6928 - val_acc: 0.5165 Epoch 4/10 7/7 [==============================] - 1s 100ms/step - loss: 0.4013 - acc: 1.0000 - val_loss: 0.6946 - val_acc: 0.5201 Epoch 5/10 7/7 [==============================] - 1s 103ms/step - loss: 0.2768 - acc: 1.0000 - val_loss: 0.6978 - val_acc: 0.5197 Epoch 6/10 7/7 [==============================] - 1s 83ms/step - loss: 0.1793 - acc: 1.0000 - val_loss: 0.6977 - val_acc: 0.5224 Epoch 7/10 7/7 [==============================] - 1s 98ms/step - loss: 0.1108 - acc: 1.0000 - val_loss: 0.6998 - val_acc: 0.5260 Epoch 8/10 7/7 [==============================] - 1s 101ms/step - loss: 0.0687 - acc: 1.0000 - val_loss: 0.7053 - val_acc: 0.5290 Epoch 9/10 7/7 [==============================] - 1s 92ms/step - loss: 0.0428 - acc: 1.0000 - val_loss: 0.7066 - val_acc: 0.5301 Epoch 10/10 7/7 [==============================] - 1s 94ms/step - loss: 0.0264 - acc: 1.0000 - val_loss: 0.7130 - val_acc: 0.5290
Building the Sentiment Analysis Model With the data prepared and the groundwork laid, we construct a deep learning model for sentiment analysis. The model architecture consists of an embedding layer, followed by a flattening layer and densely connected layers with activation functions. We employ the sigmoid activation function in the final layer to predict the sentiment (positive or negative) of the movie reviews.
prediction = model.predict(x_train)
print(prediction)
[[0.984327 ] [0.01924068] [0.0110822 ] [0.01262942] [0.97248036] [0.01588428] [0.9796188 ] [0.01627335] [0.99331343] [0.9759619 ] [0.01426247] [0.011078 ] [0.9853578 ] [0.01978219] [0.01340187] [0.98952186] [0.09334317] [0.98546576] [0.987548 ] [0.01510549] [0.99350953] [0.0184232 ] [0.9799543 ] [0.98556125] [0.02328515] [0.04164717] [0.01863596] [0.9655813 ] [0.9848035 ] [0.9886347 ] [0.01304111] [0.01048258] [0.02637956] [0.01768672] [0.9811418 ] [0.97928727] [0.98129344] [0.02160889] [0.01205647] [0.9864374 ] [0.97498035] [0.9758201 ] [0.01266143] [0.01631787] [0.01575083] [0.98028094] [0.01320836] [0.98110306] [0.9758854 ] [0.01399916] [0.9780923 ] [0.01634195] [0.01736605] [0.9789635 ] [0.97271097] [0.9748149 ] [0.02366084] [0.0201897 ] [0.9835024 ] [0.01466307] [0.98590803] [0.01941818] [0.01356769] [0.03051502] [0.02028313] [0.00762311] [0.9678347 ] [0.02212983] [0.01051256] [0.0144347 ] [0.01825279] [0.00975993] [0.9764074 ] [0.01338488] [0.9603822 ] [0.01306102] [0.0099377 ] [0.01174927] [0.01596433] [0.9799326 ] [0.97865736] [0.01460367] [0.98363364] [0.9877846 ] [0.01982439] [0.9852457 ] [0.9699247 ] [0.01035345] [0.96314734] [0.984028 ] [0.98670805] [0.01872906] [0.02205122] [0.98174816] [0.01516449] [0.98720413] [0.00789425] [0.98073363] [0.9890766 ] [0.02011073] [0.00833285] [0.9792192 ] [0.9824322 ] [0.9842001 ] [0.01176727] [0.01396188] [0.02634755] [0.9696778 ] [0.01192611] [0.01546481] [0.979192 ] [0.01560748] [0.984285 ] [0.9821718 ] [0.01592016] [0.01816052] [0.01553133] [0.02329421] [0.98296696] [0.02179718] [0.01944199] [0.01140094] [0.98220146] [0.93146443] [0.01563919] [0.98548865] [0.979843 ] [0.02142811] [0.98519623] [0.01359755] [0.97499335] [0.02284971] [0.98511255] [0.9771031 ] [0.984231 ] [0.9770901 ] [0.01693249] [0.01396808] [0.00913611] [0.02022189] [0.9841635 ] [0.01234239] [0.9760896 ] [0.9824379 ] [0.9791758 ] [0.00978518] [0.98478496] [0.9875684 ] [0.9809989 ] [0.01966113] [0.9819854 ] [0.9853251 ] [0.9855607 ] [0.0167847 ] [0.9800939 ] [0.9849909 ] [0.98740304] [0.98189616] [0.0109365 ] [0.9910811 ] [0.02606037] [0.01217309] [0.9854443 ] [0.9850347 ] [0.97875977] [0.01795807] [0.98023295] [0.01673061] [0.97788537] [0.02021855] [0.98494434] [0.9822876 ] [0.9879583 ] [0.01364279] [0.00830328] [0.01600251] [0.01818144] [0.02121943] [0.9741912 ] [0.01501963] [0.01438946] [0.00947616] [0.9803969 ] [0.01700559] [0.9901439 ] [0.9847945 ] [0.98321545] [0.04391429] [0.01551023] [0.98694754] [0.02409112] [0.9852891 ] [0.983088 ] [0.0203062 ] [0.978735 ] [0.01089823] [0.01818314] [0.9690096 ] [0.01068181] [0.01782763]]