ISTI-TALK: A journey around fine-grained vision-language models
- Day - Time: 10 June 2026, h.11:00
- Place: Area della Ricerca CNR di Pisa - Room: C-29
Speakers
Referent
Abstract
Recent vision language models (VLMs) are good at capturing global information, but they often struggle in understanding fine-grained details. This is especially true in scenarios where we lack fine-grained training data, and we have to rely on weakly-supervised methodologies. In this talk, we tackle different dimensions of the "fine-grained" perception -- from the semantic nuances (i.e., distinguishing a wooden from a plastic chair, or finding a specific person in an image) up to the spatial awareness, where the model should be able to effectively embed meaningful information from specific spatial locations in the image. We examine current limitations and propose solutions ranging from specialized benchmarks and datasets to novel architectures. These approaches aim to address the "blind spots" of current models by carefully fine-tuning existing multimodal backbones or leveraging recent promising self-supervised vision models.