Distributed Word Embeddings on Data Streams

Day - Time: 03 February 2015, h.10:30
Place: Area della Ricerca CNR di Pisa - Room: A-27

Andrea Esuli


Word representation via numerical vectors (word embeddings) is a central topic in natural language processing. The recent approaches for computing word embeddings from text corpora (e.g., word2vec) have gained popularity for their efficiency in handling huge data sets and for the quality of the word representations. The concept of representing items according to the context in which they appear can be extended to different scenarios beyond natural language. In other applications data can be very different from text, so the shape and the number of items to represent. In this work we develop a word embedding application with two goals in mind: (i) we want to learn the embeddings from a data stream, thus we have to tackle the time dimension and the possibly infinite size of the data; (ii) we want to scale and distribute the whole process on multiple machines. We show the architecture and some preliminary results of a word2vec implementation following our constraints. Results are promising in terms of efficacy and future developments of the application.