Segmenting HTML pages using visual, semantic information

Georgios Petasis, Pavlina Fragkou, Aris Theodorakos, Vangelis Karkaletsis, Constantine D. Spyropoulos


Abstract
The information explosion of the Web aggravates the problem of effective information retrieval. Even though linguistic approaches found in the literature perform linguistic annotation by creating metadata in the form of tokens, lemmas or part of speech tags, however, this process is insufficient. This is due to the fact that these linguistic metadata do not exploit the actual content of the page, leading to the need of performing semantic annotation based on a predefined semantic model. This paper proposes a new learning approach for performing automatic semantic annotation. This is the result of a two step procedure: the first step partitions a web page into blocks based on its visual layout, while the second, performs subsequent partitioning based on the examination of appearance of specific types of entities denoting the semantic category as well as the application of a number of simple heuristics. Preliminary experiments performed on a manually annotated corpus regarding athletics proved to be very promising.
Anthology ID:
2008.wac-1.4
Volume:
Proceedings of the 4th Web as Corpus Workshop
Month:
June
Year:
2008
Address:
Marrakech, Morocco
Editors:
Stefan Evert, Adam Kilgarriff, Serge Sharoff
Venues:
WAC | WS
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
18–25
Language:
URL:
https://aclanthology.org/2008.wac-1.4/
DOI:
Bibkey:
Cite (ACL):
Georgios Petasis, Pavlina Fragkou, Aris Theodorakos, Vangelis Karkaletsis, and Constantine D. Spyropoulos. 2008. Segmenting HTML pages using visual, semantic information. In Proceedings of the 4th Web as Corpus Workshop, pages 18–25, Marrakech, Morocco. European Language Resources Association.
Cite (Informal):
Segmenting HTML pages using visual, semantic information (Petasis et al., WAC 2008)
Copy Citation:
PDF:
https://aclanthology.org/2008.wac-1.4.pdf