HYBRID LEXICAL-SEMANTIC RETRIEVAL FOR IMPROVED ACADEMIC LITERATURE SEARCH

Authors

  • Fawad Khan
  • Saddam Hussain Khan
  • Hamad Khan
  • Tahir Hussain

Abstract

The rapid growth of scientific publications has made accurate and comprehensive literature search a critical challenge for researchers. Traditional keyword-based search engines often miss relevant papers that use different terminology, while semantic embedding-based retrieval can overlook exact matches for domain-specific terms. To address this limitation, this paper proposes a hybrid retrieval approach that combines lexical BM25 matching with dense semantic embeddings using a weighted fusion score. The hybrid method aims to improve both recall and ranking quality in academic document search. Experiments are conducted on a curated dataset of 100 computer science papers from the arXiv repository. Retrieval performance is evaluated using Recall@5, Recall@10, and nDCG@10. Baseline comparisons include BM25-only and dense-only retrieval. Experimental results show that the hybrid approach achieves a Recall@10 of 0.85, outperforming BM25-only (0.72) and dense-only (0.74) baselines. The hybrid method also achieves the highest nDCG@10 score of 0.83, indicating better ranking quality. These findings demonstrate that combining lexical and semantic signals significantly improves literature search effectiveness without requiring complex multi-agent systems or citation verification. The proposed hybrid retrieval is lightweight, easy to implement, and suitable for integration into academic search engines and digital libraries.

Downloads

Published

2026-05-14

How to Cite

Fawad Khan, Saddam Hussain Khan, Hamad Khan, & Tahir Hussain. (2026). HYBRID LEXICAL-SEMANTIC RETRIEVAL FOR IMPROVED ACADEMIC LITERATURE SEARCH. Spectrum of Engineering Sciences, 4(5), 1192–1203. Retrieved from https://thesesjournal.com/index.php/1/article/view/2820