Exploring Text Preprocessing Techniques In the vast landscape of Natural Language Processing (NLP), text preprocessing serves as a crucial first step in extracting meaningful insights from textual data. In this blog post, we embark on a journey through various text preprocessing techniques using Python, specifically focusing on tokenization, n-grams generation, and stop words removal.
import re
from nltk.util import ngrams
#nltk contain 157
#sklearn contain 318
from sklearn.feature_extraction.text import\
ENGLISH_STOP_WORDS as sklearn_stop_words
nltk.download('stopwords')
Tokenization with Regular Expressions: The code begins by importing the necessary libraries and defining a sentence. Using regular expressions, the sentence is split into tokens while filtering out punctuation marks and whitespace characters.
N-grams Generation: Next, the code generates bi-grams and tri-grams from the tokenized text. This process involves creating sequences of adjacent words to capture contextual information.
Stop Words Removal: The code then proceeds to remove stop words using both NLTK and Scikit-learn libraries. Stop words are common words like "the," "is," and "and" that often carry little semantic meaning and can be safely disregarded in many NLP tasks.
#tokens with re
sentence = """Thomas Jefferson began building Monticello at the\
age of 26."""
pattern = re.compile(r"([-\s.,;!?])+")
tokens = pattern.split(sentence)
tokens = [x for x in tokens if x and x not in '- \t\n.,;!?']
tokens
['Thomas', 'Jefferson', 'began', 'building', 'Monticello', 'at', 'theage', 'of', '26']
two_grams=list(ngrams(tokens, 2))
[" ".join(x) for x in two_grams]
['Thomas Jefferson', 'Jefferson began', 'began building', 'building Monticello', 'Monticello at', 'at theage', 'theage of', 'of 26']
three_grams=list(ngrams(tokens, 3))
[" ".join(x) for x in three_grams]
['Thomas Jefferson began', 'Jefferson began building', 'began building Monticello', 'building Monticello at', 'Monticello at theage', 'at theage of', 'theage of 26']
stoping words in nltk
stop_words = nltk.corpus.stopwords.words('english')
len(stop_words)
179
stop_words[:10]
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]
#words with len ==1
with_len_lessthenone=[x for x in stop_words if len(x)==1]
with_len_lessthenone
['i', 'a', 's', 't', 'd', 'm', 'o', 'y']