Localization Lens for Improving Medical Vision-Language Models
Hasan Farooq, Murtaza Taj, Mehwish Nasim, Arif Mahmood Abstract: Medical Vision-Language Models (Med-VLMs) have demonstrated strong capabilities in clinical tasks. However, they often struggle to understand anatomical structures and spatial positioning, which are crucial for medical reasoning. To address this, we propose a localization-aware enhancement to the Med-VLM pipeline, introducing improvements at three levels: data,…