The KITAAB Framework: Advancing Arabic Script OCR for Frontispieces and Bibliographic Records

Day - Time: 19 March 2026, h.11:00
Place: Area della Ricerca CNR di Pisa - Room: C-29
Speakers
  • Amina El Ganadi (Università degli Studi di Modena e Reggio Emilia)
Referent

Andrea Pedrotti

Abstract
The digitization of Arabic cultural heritage poses significant challenges because of the visual andstructural complexity of Arabic scripts. Historical frontispieces, often combining elaborateornamentation with highly stylized calligraphy, feature script traditions such as Naskh, Ruq‘ah,Thuluth, Sini, and Kufi. Their distinctive ligatures, diacritics, and context-sensitive letterforms makeaccurate text recognition difficult for conventional Optical Character Recognition (OCR) systems,which often struggle with mixed styles, decorative layouts, and inconsistent orthographic features. Inparticular, elongated ligatures, variable diacritic placement, and the blending of multiple scripts on a

single page reduce OCR performance in ornate historical materials.

This talk introduces KITAAB (Kraken-Integrated Technology for Advancing Arabic Bibliographies),an ongoing project designed to improve OCR for Arabic-script sources. Named after the Arabic wordfor “book,” KITAAB focuses on the computational analysis of Arabic texts and bibliographicmaterials. The project fine-tunes a pre-trained Kraken OCR model within eScriptorium, an open-source environment for transcription and annotation, using a curated dataset of 100 Arabic bookfrontispieces selected to reflect a wide range of calligraphic styles. By building on Kraken’s deep-learning framework and segmentation-free approach, KITAAB aims to strengthen recognition ofcomplete words, ligatures, and context-dependent letterforms across varied script forms, while laying

the groundwork for future corpus expansion.

The presentation will also address the main challenges encountered in this work, including complexpage design, rare or highly decorative script variants, and inconsistent diacritic usage. These issuesare being tackled through iterative model refinement and targeted preprocessing strategies, withparticular emphasis on improving diacritic recognition, a semantically essential feature of Arabic thatremains underrepresented in standard OCR workflows.