A FRAMEWORK FOR HATE SPEECH IDENTIFICATION USING OPTIMIZED TEXT FEATURES AND NATURAL LANGUAGE PROCESSING ON TWITTER DATASET

Irsa Manzoor; Muhammad Sajid Maqbool; Faisal Shahzad; Muqadas Nadeem; Amna Zulfiqar; Syeda Qanitah Naqvi

Authors

Irsa Manzoor
Muhammad Sajid Maqbool
Faisal Shahzad
Muqadas Nadeem
Amna Zulfiqar
Syeda Qanitah Naqvi

Keywords:

Hate Speech Recognition, Sentiment Analysis, Tweets Prediction, Machine Learning

Abstract

Twitter has emerged as a prominent social media platform where users rapidly share opinions, emotions, experiences, and real-time events. Due to the increasing volume of user-generated textual content, sentiment analysis and hate speech detection have become important research areas in the fields of Natural Language Processing (NLP) and Machine Learning (ML). Although considerable research has been conducted on hate speech detection using Twitter data, the automatic identification of multilingual hate speech, particularly in Roman Urdu and English, remains a challenging task. This research proposes a hybrid NLP-based framework for multilingual sentiment analysis using a combined dataset of Roman Urdu and English tweets collected from publicly available hate speech datasets. The datasets are integrated into a unified corpus and processed using several NLP preprocessing techniques, including stop-word removal, punctuation removal, URL elimination, tokenization, and stemming. Furthermore, optimized textual features are extracted using Python-based NLP libraries to improve the quality of the dataset for machine learning applications. To enhance feature relevance and reduce dimensionality, Principal Component Analysis (PCA) is applied to eliminate less informative features while retaining the most significant attributes. The experimental implementation is carried out using Google Colab, where multiple machine learning classifiers, including Naïve Bayes (NB), Random Forest (RF), Support Vector Machine (SVM), and Decision Tree (DT), are trained and evaluated. In addition, a Hybrid Ensemble Model (HEM) is proposed, which combines the predictions of all four classifiers to improve classification performance. The proposed system classifies users’ sentiments into three categories: Positive, Negative, and Neutral. The performance of the models is evaluated using standard evaluation metrics, including training accuracy, testing accuracy, precision, recall, and F1-score. A comparative analysis of all models is conducted to identify the most effective approach for multilingual sentiment analysis and hate speech detection on Roman Urdu and English Twitter datasets