CROSS DATASET GENERALIZATION OF MACHINE LEARNING MODELS FOR PHISHING URL DETECTION
Keywords:
Phishing detection, Machine learning, URL features, CybersecurityAbstract
Phishing attacks continue to pose a serious and persistent challenge in the domain of cyber security, resulting in substantial financial losses and the compromise of sensitive user information. With the increasing sophistication of phishing techniques, machine learning methods have become a widely adopted solution for phishing URL detection due to their ability to automatically learn distinguishing patterns from data. However, a notable weakness in existing research is the frequent reliance on single dataset evaluation, which doesn’t accurately reflect real-world operating conditions. This study addresses this limitation by examining the cross-dataset generalization capability of machine learning models for phishing URL detection. Two publicly available phishing URL datasets containing both lexical and structural URL features are utilized. Supervised learning models, namely Logistic regression, Support Vector Machine and Random Forest are trained on one dataset and evaluated on an independent dataset. Experimental results demonstrate that the Random Forest classifier consistently outperforms other models, achieving high detection accuracy while maintaining balanced precision and recall across both evaluation settings. These findings indicate that cross-dataset evaluation provides a more realistic and reliable assessment of model robustness. Overall, the study highlights the importance of moving beyond single dataset testing and offers practical insights for developing more dependable and deployable phishing detection systems.













