Text Analysis with NLP: Exploring SMS Spam Detection

In this blog post, we embark on an exploration of SMS spam detection using NLP techniques, leveraging the power of the nlpia library and machine learning algorithms.

Introduction to SMS Spam Detection With the proliferation of mobile devices and messaging platforms, the issue of spam messages has become increasingly prevalent. SMS spam detection aims to identify and filter out unwanted messages, thereby enhancing user experience and privacy. By leveraging NLP techniques, we can analyze the textual content of SMS messages to distinguish between legitimate and spam communications.

In [ ]:
import pandas as pd
!pip install nlpia
from nlpia.data.loaders import get_data
pd.options.display.width = 120
sms = get_data('sms-spam')

Getting Started with nlpia Our journey begins with the installation of the nlpia library, a versatile toolkit for natural language processing tasks. We load a dataset of SMS messages containing both legitimate and spam messages using the get_data function. This dataset serves as our foundation for training and evaluating our spam detection model.

In [ ]:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize.casual import casual_tokenize
from sklearn.decomposition import PCA
from sklearn.decomposition import TruncatedSVD
sms
Out[ ]:
spam text
sms0 0 Go until jurong point, crazy.. Available only ...
sms1 0 Ok lar... Joking wif u oni...
sms2! 1 Free entry in 2 a wkly comp to win FA Cup fina...
sms3 0 U dun say so early hor... U c already then say...
sms4 0 Nah I don't think he goes to usf, he lives aro...
... ... ...
sms4832! 1 This is the 2nd time we have tried 2 contact u...
sms4833 0 Will ü b going to esplanade fr home?
sms4834 0 Pity, * was in mood for that. So...any other s...
sms4835 0 The guy did some bitching but I acted like i'd...
sms4836 0 Rofl. Its true to its name

4837 rows × 2 columns

In [ ]:
index = ['sms{}{}'.format(i, '!'*j) for (i,j) in zip(range(len(sms)), sms.spam)]
sms.index = index
sms
Out[ ]:
spam text
sms0 0 Go until jurong point, crazy.. Available only ...
sms1 0 Ok lar... Joking wif u oni...
sms2! 1 Free entry in 2 a wkly comp to win FA Cup fina...
sms3 0 U dun say so early hor... U c already then say...
sms4 0 Nah I don't think he goes to usf, he lives aro...
... ... ...
sms4832! 1 This is the 2nd time we have tried 2 contact u...
sms4833 0 Will ü b going to esplanade fr home?
sms4834 0 Pity, * was in mood for that. So...any other s...
sms4835 0 The guy did some bitching but I acted like i'd...
sms4836 0 Rofl. Its true to its name

4837 rows × 2 columns

Text Preprocessing and Feature Extraction To prepare the textual data for analysis, we tokenize the SMS messages using casual tokenization and apply TF-IDF (Term Frequency-Inverse Document Frequency) vectorization. TF-IDF transforms the text into numerical features, capturing the importance of each word in the corpus relative to its frequency across documents. These features serve as inputs to our machine learning model.

In [ ]:
tfidf = TfidfVectorizer(tokenizer=casual_tokenize)
tfidf_docs = tfidf.fit_transform(raw_documents=sms.text).toarray()

len(tfidf.vocabulary_)
/usr/local/lib/python3.9/dist-packages/sklearn/feature_extraction/text.py:528: UserWarning: The parameter 'token_pattern' will not be used since 'tokenizer' is not None'
  warnings.warn(
Out[ ]:
9232
In [ ]:
print(tfidf_docs)
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
In [ ]:
tfidf_docs = pd.DataFrame(tfidf_docs)
tfidf_docs = tfidf_docs - tfidf_docs.mean()
tfidf_docs.shape
Out[ ]:
(4837, 9232)
In [ ]:
tfidf_docs
Out[ ]:
0 1 2 3 4 5 6 7 8 9 ... 9222 9223 9224 9225 9226 9227 9228 9229 9230 9231
0 -0.025643 -0.00584 -0.000228 -0.000053 -0.000156 -0.000943 -0.000463 -0.006695 -0.004035 -0.002745 ... -0.000264 -0.000426 -7.667659e-07 -0.001598 -0.000148 -0.000099 -0.00066 -0.000055 -0.000055 -0.000055
1 -0.025643 -0.00584 -0.000228 -0.000053 -0.000156 -0.000943 -0.000463 -0.006695 -0.004035 -0.002745 ... -0.000264 -0.000426 -7.667659e-07 -0.001598 -0.000148 -0.000099 -0.00066 -0.000055 -0.000055 -0.000055
2 -0.025643 -0.00584 -0.000228 -0.000053 -0.000156 -0.000943 -0.000463 0.096125 0.127340 0.124007 ... -0.000264 -0.000426 -7.667659e-07 -0.001598 -0.000148 -0.000099 -0.00066 -0.000055 -0.000055 -0.000055
3 -0.025643 -0.00584 -0.000228 -0.000053 -0.000156 -0.000943 -0.000463 -0.006695 -0.004035 -0.002745 ... -0.000264 -0.000426 -7.667659e-07 -0.001598 -0.000148 -0.000099 -0.00066 -0.000055 -0.000055 -0.000055
4 -0.025643 -0.00584 -0.000228 -0.000053 -0.000156 -0.000943 -0.000463 -0.006695 -0.004035 -0.002745 ... -0.000264 -0.000426 -7.667659e-07 -0.001598 -0.000148 -0.000099 -0.00066 -0.000055 -0.000055 -0.000055
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4832 0.063691 -0.00584 -0.000228 -0.000053 -0.000156 -0.000943 -0.000463 -0.006695 -0.004035 -0.002745 ... -0.000264 -0.000426 -7.667659e-07 -0.001598 -0.000148 -0.000099 -0.00066 -0.000055 -0.000055 -0.000055
4833 -0.025643 -0.00584 -0.000228 -0.000053 -0.000156 -0.000943 -0.000463 -0.006695 -0.004035 -0.002745 ... -0.000264 -0.000426 -7.667659e-07 -0.001598 -0.000148 -0.000099 -0.00066 -0.000055 -0.000055 -0.000055
4834 -0.025643 -0.00584 -0.000228 -0.000053 -0.000156 -0.000943 -0.000463 -0.006695 -0.004035 -0.002745 ... -0.000264 -0.000426 -7.667659e-07 -0.001598 -0.000148 -0.000099 -0.00066 -0.000055 -0.000055 -0.000055
4835 -0.025643 -0.00584 -0.000228 -0.000053 -0.000156 -0.000943 -0.000463 -0.006695 -0.004035 -0.002745 ... -0.000264 -0.000426 -7.667659e-07 -0.001598 -0.000148 -0.000099 -0.00066 -0.000055 -0.000055 -0.000055
4836 -0.025643 -0.00584 -0.000228 -0.000053 -0.000156 -0.000943 -0.000463 -0.006695 -0.004035 -0.002745 ... -0.000264 -0.000426 -7.667659e-07 -0.001598 -0.000148 -0.000099 -0.00066 -0.000055 -0.000055 -0.000055

4837 rows × 9232 columns

In [ ]:
sms.spam.sum()
Out[ ]:
638

Dimensionality Reduction with PCA Given the high-dimensional nature of the TF-IDF feature space, we employ Principal Component Analysis (PCA) to reduce the dimensionality and extract latent topics from the SMS messages. By projecting the data onto a lower-dimensional space, PCA enables us to identify key patterns and structures within the text, facilitating more efficient modeling and interpretation.

In [ ]:
pca = PCA(n_components=16)
pca = pca.fit(tfidf_docs)
pca_topic_vectors = pca.transform(tfidf_docs)
columns = ['topic{}'.format(i) for i in range(pca.n_components)]
pca_topic_vectors = pd.DataFrame(pca_topic_vectors, columns=columns,\
index=index)
pca_topic_vectors.round(3)
Out[ ]:
topic0 topic1 topic2 topic3 topic4 topic5 topic6 topic7 topic8 topic9 topic10 topic11 topic12 topic13 topic14 topic15
sms0 0.201 0.003 0.037 0.011 -0.019 -0.053 0.039 -0.066 0.013 -0.082 0.005 -0.009 -0.019 -0.019 -0.006 0.032
sms1 0.404 -0.094 -0.078 0.051 0.100 0.047 0.023 0.065 0.023 -0.023 -0.002 0.038 -0.045 -0.016 0.046 -0.044
sms2! -0.030 -0.048 0.090 -0.067 0.091 -0.043 -0.000 -0.002 -0.057 0.048 0.122 0.022 -0.035 0.012 -0.032 0.048
sms3 0.329 -0.033 -0.035 -0.016 0.052 0.056 -0.166 -0.074 0.062 -0.105 0.021 0.031 -0.080 -0.028 0.018 -0.070
sms4 0.002 0.031 0.038 0.034 -0.075 -0.093 -0.044 0.061 -0.044 0.028 0.028 -0.014 -0.020 0.053 -0.074 -0.016
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
sms4832! -0.126 -0.160 0.091 -0.074 0.194 0.103 -0.161 0.021 0.095 -0.113 -0.050 0.042 -0.134 -0.090 0.239 0.054
sms4833 0.070 0.078 0.003 -0.063 -0.031 0.046 -0.018 -0.079 -0.090 0.162 0.070 0.029 0.097 -0.018 -0.017 0.205
sms4834 0.077 0.043 -0.019 0.060 0.016 -0.009 -0.024 -0.022 -0.062 -0.071 -0.078 -0.097 0.005 0.050 0.024 -0.024
sms4835 -0.029 0.007 0.001 -0.015 -0.066 -0.101 -0.028 0.033 -0.111 0.035 0.018 0.033 -0.015 -0.027 0.031 -0.055
sms4836 -0.038 -0.078 0.016 -0.064 -0.007 0.007 0.038 -0.007 -0.054 -0.004 0.049 -0.003 -0.045 0.000 -0.053 -0.006

4837 rows × 16 columns

In [ ]:
tfidf.vocabulary_

Exploring Topic Vectors and Insights We examine the resulting PCA topic vectors to gain insights into the underlying themes present in the SMS messages. Through visualization and analysis of the topic weights and associations, we uncover patterns related to positive and negative sentiment, as well as the presence of deal-related content.

In [ ]:
column_nums, terms = zip(*sorted(zip(tfidf.vocabulary_.values(),\
tfidf.vocabulary_.keys())))
In [ ]:
weights = pd.DataFrame(pca.components_, columns=terms,index=['topic{}'.format(i) for i in range(16)])
In [ ]:
pd.options.display.max_columns = 8
weights.head(4).round(3)
Out[ ]:
! " # #150 ... … ┾ 〨ud 鈥
topic0 -0.071 0.008 -0.001 -0.000 ... -0.002 0.001 0.001 0.001
topic1 0.063 0.008 0.000 -0.000 ... 0.003 0.001 0.001 0.001
topic2 0.071 0.027 0.000 0.001 ... 0.002 -0.001 -0.001 -0.001
topic3 -0.059 -0.032 -0.001 -0.000 ... 0.001 0.000 0.000 0.000

4 rows × 9232 columns

In [ ]:
pd.options.display.max_columns = 10
deals = weights['! ;) :) half off free crazy deal only $ 80 %'.split()].round(3) * 100
deals.head(5)
Out[ ]:
! ;) :) half off ... deal only $ 80 %
topic0 -7.1 0.1 -0.5 -0.0 -0.4 ... -0.1 -2.2 0.3 -0.0 -0.0
topic1 6.3 0.0 7.4 0.1 0.4 ... -0.1 -3.8 -0.1 -0.0 -0.2
topic2 7.1 0.2 -0.1 0.1 0.3 ... -0.1 0.7 0.0 0.0 0.1
topic3 -5.9 -0.3 -7.1 0.2 0.3 ... 0.1 -2.3 0.1 -0.1 -0.3
topic4 38.1 -0.1 -12.5 -0.1 -0.2 ... -0.2 3.0 0.3 0.1 -0.1

5 rows × 12 columns

In [ ]:
deals.T.sum()
#Topics 4, 8, and 9 appear to all contain positive “deal” topic sentiment. And topics 0, 3,
#and 5 appear to be “anti-deal” topics, messages about stuff that’s the opposite of “deals”:
#negative deals.
Out[ ]:
topic0    -11.9
topic1      7.5
topic2     12.8
topic3    -15.5
topic4     38.3
topic5    -33.9
topic6      4.8
topic7     -5.0
topic8     40.6
topic9     32.0
topic10   -29.1
topic11    48.3
topic12     3.5
topic13    47.5
topic14    32.0
topic15    -4.2
dtype: float64

Dimension Reduction using SVD Additionally, we explore dimension reduction using Truncated Singular Value Decomposition (SVD), another technique for reducing the dimensionality of sparse data matrices. By transforming the TF-IDF matrix using SVD, we obtain topic vectors that capture the latent semantic structure of the SMS messages, facilitating further analysis and interpretation.

In [ ]:
svd = TruncatedSVD(n_components=16, n_iter=100)
svd_topic_vectors = svd.fit_transform(tfidf_docs.values)
svd_topic_vectors = pd.DataFrame(svd_topic_vectors, columns=columns,\
index=index)
In [ ]:
svd_topic_vectors.round(3).head(6)
Out[ ]:
topic0 topic1 topic2 topic3 topic4 ... topic11 topic12 topic13 topic14 topic15
sms0 0.201 0.003 0.037 0.011 -0.019 ... -0.007 0.002 -0.036 -0.014 0.037
sms1 0.404 -0.094 -0.078 0.051 0.100 ... 0.036 0.043 -0.021 0.051 -0.042
sms2! -0.030 -0.048 0.090 -0.067 0.091 ... 0.023 0.026 -0.020 -0.042 0.052
sms3 0.329 -0.033 -0.035 -0.016 0.052 ... 0.023 0.073 -0.046 0.022 -0.070
sms4 0.002 0.031 0.038 0.034 -0.075 ... -0.009 0.027 0.034 -0.083 -0.021
sms5! -0.016 0.059 0.014 -0.006 0.122 ... 0.055 -0.037 0.075 -0.001 0.020

6 rows × 16 columns

In [ ]:
svd_topic_vectors = (svd_topic_vectors.T / np.linalg.norm(\
svd_topic_vectors, axis=1)).T
In [ ]:
svd_topic_vectors.iloc[:20].dot(svd_topic_vectors.iloc[:20].T).round(1)
Out[ ]:
sms0 sms1 sms2! sms3 sms4 ... sms15! sms16 sms17 sms18 sms19!
sms0 1.0 0.6 -0.1 0.6 -0.0 ... -0.2 0.4 0.2 -0.1 -0.2
sms1 0.6 1.0 -0.2 0.8 -0.2 ... -0.2 0.3 0.4 -0.0 -0.2
sms2! -0.1 -0.2 1.0 -0.2 0.1 ... 0.3 -0.1 -0.4 -0.2 0.9
sms3 0.6 0.8 -0.2 1.0 -0.2 ... -0.3 0.2 0.7 0.1 -0.2
sms4 -0.0 -0.2 0.1 -0.2 1.0 ... -0.0 -0.1 -0.1 -0.2 0.2
sms5! -0.3 0.0 0.4 -0.3 0.2 ... 0.2 -0.2 -0.5 -0.1 0.4
sms6 -0.3 -0.2 0.0 -0.1 0.0 ... -0.2 -0.2 0.1 0.1 0.2
sms7 -0.1 -0.2 0.3 -0.3 0.1 ... 0.5 -0.1 -0.1 -0.2 0.6
sms8! -0.3 -0.1 0.5 -0.2 -0.4 ... -0.1 -0.2 -0.1 -0.2 0.4
sms9! -0.3 -0.1 0.4 -0.1 -0.2 ... 0.7 -0.2 -0.3 0.2 0.5
sms10 -0.2 -0.3 0.1 -0.3 0.3 ... 0.1 0.4 -0.2 -0.2 0.0
sms11! -0.2 -0.2 0.8 -0.2 0.3 ... 0.3 -0.1 -0.5 -0.3 0.9
sms12! -0.1 -0.1 0.6 -0.2 -0.2 ... 0.2 -0.2 -0.4 -0.1 0.5
sms13 -0.5 -0.4 -0.2 -0.5 -0.1 ... 0.1 -0.2 0.1 0.1 -0.0
sms14 -0.1 -0.1 0.0 -0.2 0.1 ... -0.1 -0.1 0.0 0.0 -0.0
sms15! -0.2 -0.2 0.3 -0.3 -0.0 ... 1.0 -0.1 -0.5 0.4 0.5
sms16 0.4 0.3 -0.1 0.2 -0.1 ... -0.1 1.0 0.1 -0.1 -0.2
sms17 0.2 0.4 -0.4 0.7 -0.1 ... -0.5 0.1 1.0 0.1 -0.4
sms18 -0.1 -0.0 -0.2 0.1 -0.2 ... 0.4 -0.1 0.1 1.0 -0.0
sms19! -0.2 -0.2 0.9 -0.2 0.2 ... 0.5 -0.2 -0.4 -0.0 1.0

20 rows × 20 columns

In [ ]: