Identifying Genres of Web Pages

Marina Santini


Abstract
In this paper, we present an inferential model for text type and genre identification of Web pages, where text types are inferred using a modified form of Bayes’ theorem, and genres are derived using a few simple if-then rules. As the genre system on the Web is a complex phenomenon, and Web pages are usually more unpredictable and individualized than paper documents, we propose this approach as an alternative to unsupervised and supervised techniques. The inferential model allows a classification that can accommodate genres that are not entirely standardized, and is more capable of reading a Web page, which is mixed, rarely corresponding to an ideal type and often showing a mixture of genres or no genre at all. A proper evaluation of such a model remains an open issue.
Anthology ID:
2006.jeptalnrecital-long.28
Volume:
Actes de la 13ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs
Month:
April
Year:
2006
Address:
Leuven, Belgique
Editors:
Piet Mertens, Cédrick Fairon, Anne Dister, Patrick Watrin
Venue:
JEP/TALN/RECITAL
SIG:
Publisher:
ATALA
Note:
Pages:
308–317
Language:
URL:
https://aclanthology.org/2006.jeptalnrecital-long.28
DOI:
Bibkey:
Cite (ACL):
Marina Santini. 2006. Identifying Genres of Web Pages. In Actes de la 13ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs, pages 308–317, Leuven, Belgique. ATALA.
Cite (Informal):
Identifying Genres of Web Pages (Santini, JEP/TALN/RECITAL 2006)
Copy Citation:
PDF:
https://aclanthology.org/2006.jeptalnrecital-long.28.pdf