X-HASHNET: A VISION TRANSFORMER-BASED DEEP HASHING FRAMEWORK FOR EFFICIENT SEMANTIC IMAGE RETRIEVAL

Muhammad Irfan; Javiriya Hameed Arain; Kinza Fatima

Authors

Muhammad Irfan
Javiriya Hameed Arain
Kinza Fatima

Abstract

X-HashNet presents a transformer based supervised hashing scheme to enable highly efficient fashion image retrieval. Leveraging the architecture of the famous popular vision model called the DeiT-Small Vision Transformer, the model has substituted the convolutional architectures with a global call to attention representation paradigm. Evaluated on the Fashion Minority dataset (Fashion-Mnist), which consists of 70,000 grayscale images from 10 categories of apparel, X- HashNet is used to make number (64 bit) binary embeddings from the raw inputs that are optimized for the retrieval based on the same hamming distance. The pipeline combines 5 key stages: ViT adoptable patch embedding, transformer-based feature extraction, supervised bottleneck hashing, multi-objective optimization and FAISS based binarization and indexing. The model has a mean Average Precision (mAP@100) of 0.9348, which is a new state-of-the-art benchmark for hashing on Fashion-MNIST. As we know, diagnostic analyses confirm the best utilization of codes and the average bit activation is 0.4907, inter & intra class hamming distances are 2.29 & 0.28 bits respectively and hash stability is 84.20 per cent. The bit redundancy score of 0.2775 and near ideal entropy distribution mean efficient encoding of information in all the hash dimensions. Empirical results also validate a strong level of generalization, yielding Precision@1 of 93.39% and maintaining the stability of the performance in deeper response latencies (P@5-P@100 ~93%). From a systems perspective, the average query time of 0.1738 milliseconds and throughput of more than 5 754 queries per second make X HashNet suitable for large scale deployment, where a mere 8 bytes per image are used to index the images. Visual attention maps validate the model's ability to both localize and maintain important structural features (e.g. silhouettes and textures of clothing). Collectively, these results show evidence that transformer-based hashing not only outperforms CNN counterparts in retrieval accuracy, but provides a scalable industrially-viable foundation for real time search and recommendation systems for fashion.

Keywords

Deep Supervised Hashing, Vision Transformers (ViT), DeiT (Data-efficient Image Transformers), Fashion Image Retrieval, Binary Hash Codes, Hamming Distance Search, Fashion-MNIST, Multi-Objective Optimization, Self-Attention Mechanisms, FAISS Indexing.