Marina Santini


pdf bib
Visualizing Facets of Text Complexity across Registers
Marina Santini | Arne Jonsson | Evelina Rennes
Proceedings of the 1st Workshop on Tools and Resources to Empower People with REAding DIfficulties (READI)

In this paper, we propose visualizing results of a corpus-based study on text complexity using radar charts. We argue that the added value of this type of visualisation is the polygonal shape that provides an intuitive grasp of text complexity similarities across the registers of a corpus. The results that we visualize come from a study where we explored whether it is possible to automatically single out different facets of text complexity across the registers of a Swedish corpus. To this end, we used factor analysis as applied in Biber’s Multi-Dimensional Analysis framework. The visualization of text complexity facets with radar charts indicates that there is correspondence between linguistic similarity and similarity of shape across registers.


pdf bib
Comparing the Performance of Feature Representations for the Categorization of the Easy-to-Read Variety vs Standard Language
Marina Santini | Benjamin Danielsson | Arne Jönsson
Proceedings of the 22nd Nordic Conference on Computational Linguistics

We explore the effectiveness of four feature representations – bag-of-words, word embeddings, principal components and autoencoders – for the binary categorization of the easy-to-read variety vs standard language. Standard language refers to the ordinary language variety used by a population as a whole or by a community, while the “easy-to-read” variety is a simpler (or a simplified) version of the standard language. We test the efficiency of these feature representations on three corpora, which differ in size, class balance, unit of analysis, language and topic. We rely on supervised and unsupervised machine learning algorithms. Results show that bag-of-words is a robust and straightforward feature representation for this task and performs well in many experimental settings. Its performance is equivalent or equal to the performance achieved with principal components and autoencorders, whose preprocessing is however more time-consuming. Word embeddings are less accurate than the other feature representations for this classification task.


pdf bib
Book Review: Discourse on the Move: Using Corpus Analysis to Describe Discourse Structure by Douglas Biber, Ulla Connor, and Thomas A. Upton
Marina Santini
Computational Linguistics, Volume 35, Number 1, March 2009


pdf bib
Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems
Georg Rehm | Marina Santini | Alexander Mehler | Pavel Braslavski | Rüdiger Gleim | Andrea Stubbe | Svetlana Symonenko | Mirko Tavosanis | Vedrana Vidulin
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

We present initial results from an international and multi-disciplinary research collaboration that aims at the construction of a reference corpus of web genres. The primary application scenario for which we plan to build this resource is the automatic identification of web genres. Web genres are rather difficult to capture and to describe in their entirety, but we plan for the finished reference corpus to contain multi-level tags of the respective genre or genres a web document or a website instantiates. As the construction of such a corpus is by no means a trivial task, we discuss several alternatives that are, for the time being, mostly based on existing collections. Furthermore, we discuss a shared set of genre categories and a multi-purpose tool as two additional prerequisites for a reference corpus of web genres.


pdf bib
Identifying Genres of Web Pages
Marina Santini
Actes de la 13ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

In this paper, we present an inferential model for text type and genre identification of Web pages, where text types are inferred using a modified form of Bayes’ theorem, and genres are derived using a few simple if-then rules. As the genre system on the Web is a complex phenomenon, and Web pages are usually more unpredictable and individualized than paper documents, we propose this approach as an alternative to unsupervised and supervised techniques. The inferential model allows a classification that can accommodate genres that are not entirely standardized, and is more capable of reading a Web page, which is mixed, rarely corresponding to an ideal type and often showing a mixture of genres or no genre at all. A proper evaluation of such a model remains an open issue.

pdf bib
Interpreting Genre Evolution on the Web
Marina Santini
Proceedings of the Workshop on NEW TEXT Wikis and blogs and other dynamic text sources

pdf bib
Implementing a Characterization of Genre for Automatic Genre Identification of Web Pages
Marina Santini | Richard Power | Roger Evans
Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions


pdf bib
Clustering Web Pages to Identify Emerging Textual Patterns
Marina Santini
Actes de la 12ème conférence sur le Traitement Automatique des Langues Naturelles. REncontres jeunes Chercheurs en Informatique pour le Traitement Automatique des Langues (articles courts)

The Web has triggered many adjustments in many fields. It also has had a strong impact on the genre repertoire. Novel genres have already emerged, e.g. blog and FAQs. Presumably, other new genres are still in formation, because the Web is still fluid and in constant change. In this paper we present an experiment that explores the possibility of automatically detecting the emerging textual patterns that are slowly taking shape on the Web. Emerging textual patterns can develop into novel Web genres or novel text types in the near future. The experimental set up includes a collection of unclassified web pages, two sets of features and the use of cluster analysis. Results are encouraging and deserve further investigation.