Cross-lingual Word Clusters for Direct Transfer of Linguistic Structure

Day - Time: 07 February 2012, h.11:00
Place: Area della Ricerca CNR di Pisa - Room: C-29
  • Oscar Täckström (SICS / Uppsala University, Sweden)

Fabrizio Sebastiani


The ability to predict the linguistic structure of sentences or documents is central to the study of natural language processing. While annotated resources for parsing and several other tasks are available in a number of languages, we cannot expect to have access to labeled resources for all tasks in all languages. In this talk I will describe how cross-lingual word clusters can be used as a way to sidestep this problem, focusing on the important tasks of syntactic dependency parsing and named-entity recognition (NER). First, I will show how monolingual word clusters can be used to improve parsing and NER for a range of different languages, across families. I will then describe an algorithm for inducing cross-lingual word clusters using large corpora and word alignments and how these clusters can significantly improve the accuracy of cross-lingual structure prediction. Specifically, I will show how an English dependency parser and NER system can be transferred to a range of other languages, without any need for target language training data.