Arabic Multimodal Language Modeling

Day - Time: 12 May 2025, h.11:00
Place: Area della Ricerca CNR di Pisa - Room: Aula Faedo
Speakers
Referent

Giovanni Puccetti

Abstract
We present ArTST, a pre-trained Arabic text and speech transformer that extends the unified-modal framework of SpeechT5, originally developed for English. ArTST jointly learns speech and text representations within a single model architecture, enabling it to support multi-modal input and output during pre-training. This unified model can be fine-tuned individually for a variety of downstream tasks, including speech recognition, speech synthesis, speech enhancement, speaker identification, and dialect identification.Subsequently, we explore the research question “Can we train a single model simultaneously for multiple cross-modal speech-text tasks?”. Speech recognition and speech synthesis models are usually trained separately, each with its own objectives, datasets, and model parameters, resulting in two distinct large networks. We adopt the SpeechT5 framework for unified fine-tuning and report promising results in high (English) and low (Arabic) resource settings.