AbstractThis paper discusses the inherent difficulties in evaluating systems for theme detection. Such systems are based essentially on unsupervised clustering aiming to discover the underlying structure in a corpus of texts. As the structures are precisely unknown beforehand, it is difficult to devise a satisfactory evaluation protocol. Several problems are posed by cluster evaluation: determining the optimal number of clusters, cluster content evaluation, topology of the discovered structure. Each of these problems has been studied separately but some of the proposed metrics portray significant flaws. Moreover, no benchmark has been commonly agreed upon. Finally, it is necessary to distinguish between task-oriented and activity-oriented evaluation as the two frameworks imply different evaluation protocols. Possible solutions to the activity-oriented evaluation can be sought from the data and text mining communities.