|

Localization Lens for Improving Medical Vision-Language Models

Hasan Farooq, Murtaza Taj, Mehwish Nasim, Arif Mahmood

Abstract:

Medical Vision-Language Models (Med-VLMs) have demonstrated strong capabilities in clinical tasks. However, they often struggle to understand anatomical structures and spatial positioning, which are crucial for medical reasoning. To address this, we propose a localization-aware enhancement to the Med-VLM pipeline, introducing improvements at three levels: data, architecture, and alignment. First, we introduce localization lens, a set of expert-validated representations that provide richer anatomical and positional context. However, as these representations increase input complexity, we integrate pixel shuffle within the model architecture to filter and refine representations, enhancing spatial information processing while preserving anatomical continuity. Lastly, to effectively align the localization lens representations with textual features, we incorporate decoupled contrastive loss (DCL) alongside the standard loss function. This ensures better feature discrimination and robustness, particularly in data-limited medical settings. Through extensive evaluations on medical visual question answering (Med-VQA) datasets, we show that our methodology improves localization-driven performance across different Med-VLM architectures. Our analysis of localization-based questions further reveals that improvements in anatomy and spatial reasoning directly enhance the overall accuracy of Med-VQA upto 6.2%. The proposed approach is model-agnostic and can be seamlessly integrated into existing Med-VLM pipelines.

Text Reference:

 H. Farooq, M. Taj, M. Nasim, and A. Mahmood, "Localization Lens for Improving Medical Vision-Language Models," in Proc. of the Int. Conf. on Medical Image Computing and Computer Assisted Intervention (MICCAI), 2025

Code: The code is available at Github .
Bibtex Reference:

@inproceedings{localizationlensMICCAI2025,
  author={H. Farooq, M. Taj, M. Nasim, and A. Mahmood},
  title={Localization Lens for Improving Medical Vision-Language Models},
  booktitle={Proc. of the Int. Conf. on Medical Image Computing and Computer Assisted Intervention (MICCAI)},
  year={2025},
}

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *