Utility-Theoretic Ranking for Semi-Automated Text Classification

Day - Time: 20 May 2013, h.11:00
Place: Area della Ricerca CNR di Pisa - Room: C-29

Andrea Esuli


Suppose an organization needs to classify a set D of textual documents, and suppose that D is too large to be classified manually, so that resorting to some form of automated text classification (TC) is the only viable option. Suppose also that the organization has strict accuracy standards, so that the level of effectiveness obtainable via state-of-the-art TC technology is not sufficient. In this case, the most plausible strategy to follow is to classify D by means of an automatic classifier F, and then to have a human editor inspect the results of the automatic classification, correcting misclassifications where appropriate. The human annotator will obviously inspect only a subset D' of D (since it would not otherwise make sense to have an initial automated classification phase). We call this scenario Semi-Automated Text Classification (SATC). An automated system can support this process by ranking the automatically labelled documents in a way that maximizes the expected increase in effectiveness that derives from inspecting D'. An obvious strategy is to rank D so that the documents that F has classified with the lowest confidence are top-ranked. In this work we show that this strategy is suboptimal. We develop a new utility-theoretic ranking method based on the notion of inspection gain, defined as the improvement in classification effectiveness that would derive by inspecting and correcting a given automatically labelled document. We also propose a new effectiveness measure for SATC-oriented ranking methods, based on the expected reduction in classification error brought about by partially inspecting a list generated by a given ranking method.
We report the results of experiments showing that, with respect to the baseline method above, and according to the proposed measure, our ranking method can achieve substantially higher expected reductions in classification error.

NOTE: This seminar is the third one of the series of six seminars presented by the winners of the prize "Young researchers ISTI 2013". Giacomo Berardi placed first in the PhD student category.