Exploring Text Preprocessing Techniques In the vast landscape of Natural Language Processing (NLP), text preprocessing serves as a crucial first step in extracting meaningful insights from textual data. In this blog post, we embark on a journey through various text preprocessing techniques using Python, specifically focusing on tokenization, n-grams generation, and stop words removal.

In [ ]:
import re
from nltk.util import ngrams
#nltk contain 157
#sklearn contain 318
from sklearn.feature_extraction.text import\
ENGLISH_STOP_WORDS as sklearn_stop_words
nltk.download('stopwords')

Tokenization with Regular Expressions: The code begins by importing the necessary libraries and defining a sentence. Using regular expressions, the sentence is split into tokens while filtering out punctuation marks and whitespace characters.

N-grams Generation: Next, the code generates bi-grams and tri-grams from the tokenized text. This process involves creating sequences of adjacent words to capture contextual information.

Stop Words Removal: The code then proceeds to remove stop words using both NLTK and Scikit-learn libraries. Stop words are common words like "the," "is," and "and" that often carry little semantic meaning and can be safely disregarded in many NLP tasks.

In [ ]:
#tokens with re
sentence = """Thomas Jefferson began building Monticello at the\
age of 26."""
pattern = re.compile(r"([-\s.,;!?])+")
tokens = pattern.split(sentence)
tokens = [x for x in tokens if x and x not in '- \t\n.,;!?']
tokens
Out[ ]:
['Thomas',
 'Jefferson',
 'began',
 'building',
 'Monticello',
 'at',
 'theage',
 'of',
 '26']
In [ ]:
two_grams=list(ngrams(tokens, 2))
[" ".join(x) for x in two_grams]
Out[ ]:
['Thomas Jefferson',
 'Jefferson began',
 'began building',
 'building Monticello',
 'Monticello at',
 'at theage',
 'theage of',
 'of 26']
In [ ]:
three_grams=list(ngrams(tokens, 3))
[" ".join(x) for x in three_grams]
Out[ ]:
['Thomas Jefferson began',
 'Jefferson began building',
 'began building Monticello',
 'building Monticello at',
 'Monticello at theage',
 'at theage of',
 'theage of 26']

stoping words in nltk

In [ ]:
stop_words = nltk.corpus.stopwords.words('english')
In [ ]:
len(stop_words)
Out[ ]:
179
In [ ]:
stop_words[:10]
Out[ ]:
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]
In [ ]:
#words with len ==1
with_len_lessthenone=[x for x in stop_words if len(x)==1]
with_len_lessthenone
Out[ ]:
['i', 'a', 's', 't', 'd', 'm', 'o', 'y']