ISTI-Talk: A Heterogeneous Path in NLP: A Multi-Level Approach to Evaluation
- Day - Time: 12 November 2025, h.12:00
- Place: Area della Ricerca CNR di Pisa - Room: C-29
Speakers
Referent
Abstract
As AI models grow more powerful, our methods for evaluating them must also become more sophisticated. This talk confronts this challenge by posing a series of progressively more probing questions: How can we test whether a model’s performance stems from genuine understanding rather than spurious shortcuts? How closely do a model’s learned representations align with human cognition? And how robust are the very tools we build to monitor and control these systems?
This talk addresses these questions through three key projects. First, we tackle the problem of shortcut-learning with ViLMA, a benchmark that uses counterfactuals to evaluate deep multimodal reasoning. Next, we probe cognitive alignment by comparing LLMs’ conceptual structures with human data, uncovering significant differences. Finally, we evaluate the robustness of safety tools by stress-testing AI-text detectors with adversarial attacks, exposing their critical weaknesses.
By weaving these three threads together, the talk delivers a perspective on the multifaceted challenges of evaluation, highlighting its relevance to the broader discourse on AI.