Localization Lens for Improving Medical Vision-Language Models

Bycvlabadmin September 13, 2025September 13, 2025

Hasan Farooq, Murtaza Taj, Mehwish Nasim, Arif Mahmood

Abstract:

Medical Vision-Language Models (Med-VLMs) have demonstrated strong capabilities in clinical tasks. However, they often struggle to understand anatomical structures and spatial positioning, which are crucial for medical reasoning. To address this, we propose a localization-aware enhancement to the Med-VLM pipeline, introducing improvements at three levels: data, architecture, and alignment. First, we introduce localization lens, a set of expert-validated representations that provide richer anatomical and positional context. However, as these representations increase input complexity, we integrate pixel shuffle within the model architecture to filter and refine representations, enhancing spatial information processing while preserving anatomical continuity. Lastly, to effectively align the localization lens representations with textual features, we incorporate decoupled contrastive loss (DCL) alongside the standard loss function. This ensures better feature discrimination and robustness, particularly in data-limited medical settings. Through extensive evaluations on medical visual question answering (Med-VQA) datasets, we show that our methodology improves localization-driven performance across different Med-VLM architectures. Our analysis of localization-based questions further reveals that improvements in anatomy and spatial reasoning directly enhance the overall accuracy of Med-VQA upto 6.2%. The proposed approach is model-agnostic and can be seamlessly integrated into existing Med-VLM pipelines.

Text Reference:

 H. Farooq, M. Taj, M. Nasim, and A. Mahmood, "Localization Lens for Improving Medical Vision-Language Models," in Proc. of the Int. Conf. on Medical Image Computing and Computer Assisted Intervention (MICCAI), 2025

Code: The code is available at Github .
Bibtex Reference:

@inproceedings{localizationlensMICCAI2025,
  author={H. Farooq, M. Taj, M. Nasim, and A. Mahmood},
  title={Localization Lens for Improving Medical Vision-Language Models},
  booktitle={Proc. of the Int. Conf. on Medical Image Computing and Computer Assisted Intervention (MICCAI)},
  year={2025},
}

Conference Papers | Publications

CGI 2016 – Coarse-to-fine model fitting on point cloud

By June 28, 2016January 10, 2019

Reema Bajwa, Syed Rizwan Gilani, Murtaza Taj Short Paper Proceedings of the 33rd Computer Graphics International, Heraklion, Greece, June 28 – July 1, 2016 Abstract We present a coarse-to-fine model fitting approach that automatically generates a detailed CAD like model from a point cloud. We first developed a library of detailed parametric models for each…

News | Teaching

Summer Internships 2018

By June 5, 2018January 9, 2019

The Computer Vision Lab hosted a rigorous summer internship program for undergraduate students. Sophomore, Junior and Senior interns worked for 2 months in the lab, under the supervision of faculty and PhD students. The students worked both individually and in groups, on a range of ideas, from making a campus 3D model to automatic generation…

Alumni | People | PhD Students

Numan Khurshid

By January 9, 2019November 18, 2023

Numan is a senior PhD Student in Computer Vision & Graphics Lab (cvglab) at LUMS Syed Babar Ali School of Science and Engineering.

Journal Papers | Publications

JBC 2010 – Myosin Motors Drive Long Range Alignment of Actin Filaments

By February 12, 2010February 7, 2015

Tariq Butt, Tabish Mufti, Ahmad Humayun, Peter B. Rosenthal, Sohaib Khan, Shahid Khan, Justin E. Molloy Journal of Biological Chemistry, Vol 285, No 7, Feb 12, 2010 Abstract The bulk alignment of actin filament sliding movement, powered by randomly oriented myosin molecules, has been observed and studied using an in vitro motility assay. The well…

News

PhD Proposal Defense: Usman Nazir

By September 12, 2019September 24, 2025

Learning Socio-economic Indicators from Remote Sensing Data Thursday 12 Sep, 2019 at 03:30 am in CS Smart Room 9-105 SBASSE. Abstract Progress on the UN Sustainable Development Goals (SDGs) is hampered by a persistent lack of data regarding key social, environmental, and economic indicators, particularly in developing countries. For example, data on poverty and slavery,…

Journal Papers | Publications

IEEE TGRS – A Residual-Dyad Encoder Discriminator Network for Remote Sensing Image Matching

By September 27, 2019September 27, 2019

Numan Khurshid, Mohbat Tharani, Murtaza Taj, and Faisal Qureshi Abstract: We propose a new method for remote sensing image matching. The proposed method uses encoder subnetwork of an autoencoder pre-trained on GTCrossView data to construct image features. A discriminator network trained on University of California Merced Land Use/Land Cover dataset (LandUse) and High-resolution Satellite Scene…

Similar Posts

Leave a Reply Cancel reply