import re
In the diverse landscape of natural language processing (NLP), the analysis of non-English text presents unique challenges and opportunities. In this blog post, we embark on an exploration of Urdu text processing, focusing on the extraction of common words using Python and regular expressions.
Introduction to Urdu Text Processing Urdu, a rich and expressive language spoken by millions worldwide, offers a wealth of textual data for analysis and understanding. However, processing Urdu text poses challenges due to its unique script and linguistic characteristics. By leveraging Python and NLP techniques, we can unlock the potential of Urdu text processing and extract valuable insights from Urdu-language content.
Getting Started with Python and Regular Expressions Our journey begins with the installation of Python and the exploration of regular expressions—a powerful tool for pattern matching and text manipulation. We define a sample Urdu text containing meaningful content, which serves as our dataset for analysis.
string_test="""بے چاری عوام چونکہ ہمیشہ سے دھوکہ کھانے کی عادی رہی ہے اس لئے ‘‘تبدیلی سرکار’’ کی چکنی چپڑی باتوں میں آگئی اور اپنے بہتر مستقبل
کے لئے نئی حکومت کو اقتدار کے ایوانوں تک پہنچا دیا"""
Tokenization and Word Segmentation To analyze the Urdu text, we implement a custom tokenization function using regular expressions. This function segments the text into individual words, preserving the linguistic structure and ensuring accurate analysis of the textual content.
**Extracting Common Words ** Next, we identify common words in the Urdu text by comparing each word against a predefined list of frequently occurring terms. By filtering out common words from the text, we focus on extracting meaningful and contextually relevant content, enhancing our understanding of the underlying themes and topics.
# word_tokenize(string_test)
def regex_tokken(string):
words = re.findall(r'\w+', string)
return words
def read_from_file(filename):
with open(filename,'r') as file:
file_content=file.readlines()
sum_of_words = ''.join(file_content)
return sum_of_words
def findig_common_words(string,common_word_list):
tmp_string=''
list_of_words_after_adding=[]
words_list=[]
final_str=''
for i in string:
tmp_string=i
tmp_string+=" "
words_list.append(tmp_string)
if i in common_word_list and len(words_list)>8:
sum_of_words = ''.join(words_list)
list_of_words_after_adding.append(sum_of_words)
for j in list_of_words_after_adding:
final_str=final_str+j+'-'
return "-"+final_str
def main():
file_content=read_from_file('sample.txt')
segmention_of_words=regex_tokken(file_content)
list_of_words=["ہونا", "ہونگے", "ہونی", "ہوں", "ہی", "ہیں", "ہے", "یہ", "یہاں", "یہی", "یہیں","تھا","تھی","تھیں","تھے","نہیں","گا","گئی","گیا","دیا"]
text=findig_common_words(segmention_of_words,list_of_words)
print(text)
# with open('output.txt','w') as f:
# f.writelines(text)
main()
-بے چاری عوام چونکہ ہمیشہ سے دھوکہ کھانے کی عادی رہی ہے -بے چاری عوام چونکہ ہمیشہ سے دھوکہ کھانے کی عادی رہی ہے اس لئے تبدیلی سرکار کی چکنی چپڑی باتوں میں آگئی اور اپنے بہتر مستقبل کے لئے نئی حکومت کو اقتدار کے ایوانوں تک پہنچا دیا -