Proceedings of the 4th Web as Corpus Workshop

Stefan Evert, Adam Kilgarriff, Serge Sharoff (Editors)


Anthology ID:
2008.wac-1
Month:
June
Year:
2008
Address:
Marrakech, Morocco
Venues:
WAC | WS
SIG:
Publisher:
European Language Resources Association
URL:
https://aclanthology.org/2008.wac-1/
DOI:
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
https://aclanthology.org/2008.wac-1.pdf

pdf bib
Proceedings of the 4th Web as Corpus Workshop
Stefan Evert | Adam Kilgarriff | Serge Sharoff

pdf bib
Reranking Google with GReG
Rodolfo Delmonte | Marco Aldo Piccolino Boniforti

We present an experiment evaluating the contribution of a system called GReG for reranking the snippets returned by Google’s search engine in the 10 best links presented to the user, captured by the use of Google’s API. The evaluation aims at establishing whether or not the introduction of deep linguistic information may improve the accuracy of Google or rather it is the opposite case as maintained by the majority of people working in Information Retrieval, using a Bag Of Words approach. We used 900 questions, answers taken from TREC 8, 9 competitions, execute three different types of evaluation: one without any linguistic aid; a second one with tagging, syntactic constituency contribution; another run with what we call Partial Logical Form. Even though GReG is still work in progress, it is possible to draw clearcut conclusions: adding linguistic information to the evaluation process of the best snippet that can answer a question improves enormously the performance. In another experiment we used the actual associated to the Q/A pairs distributed by one of TREC’s participant, got even higher accuracy.

pdf bib
Google for the Linguist on a Budget
András Kornai | Péter Halácsy

In this paper, we present GLB, yet another open source, free system to create, exploit linguistic corpora gathered from the web. A simple, robust web crawl algorithm, a multi-dimensional information retrieval tool„ a crude parallelization mechanism are proposed, especially for researchers working in resource-limited environments.

pdf bib
Victor: the Web-Page Cleaning Tool
Miroslav Spousta | Michal Marek | Pavel Pecina

In this paper we present a complete solution for automatic cleaning of arbitrary HTML pages with a goal of using web data as a corpus in the area of natural language processing, computational linguistics. We employ a sequence-labeling approach based on Conditional Random Fields (CRF). Every block of text in analyzed web page is assigned a set of features extracted from the textual content, HTML structure of the page. The blocks are automatically labeled either as content segments containing main web page content, which should be preserved, or as noisy segments not suitable for further linguistic processing, which should be eliminated. Our solution is based on the tool introduced at the CLEANEVAL 2007 shared task workshop. In this paper, we present new CRF features, a handy annotation tool„ new evaluation metrics. Evaluation itself is performed on a random sample of web pages automatically downloaded from the Czech web domain.

pdf bib
Segmenting HTML pages using visual, semantic information
Georgios Petasis | Pavlina Fragkou | Aris Theodorakos | Vangelis Karkaletsis | Constantine D. Spyropoulos

The information explosion of the Web aggravates the problem of effective information retrieval. Even though linguistic approaches found in the literature perform linguistic annotation by creating metadata in the form of tokens, lemmas or part of speech tags, however, this process is insufficient. This is due to the fact that these linguistic metadata do not exploit the actual content of the page, leading to the need of performing semantic annotation based on a predefined semantic model. This paper proposes a new learning approach for performing automatic semantic annotation. This is the result of a two step procedure: the first step partitions a web page into blocks based on its visual layout, while the second, performs subsequent partitioning based on the examination of appearance of specific types of entities denoting the semantic category as well as the application of a number of simple heuristics. Preliminary experiments performed on a manually annotated corpus regarding athletics proved to be very promising.

pdf bib
Identification of Duplicate News Stories in Web Pages
John Gibson | Ben Wellner | Susan Lubar

Identifying near duplicate documents is a challenge often faced in the field of information discovery. Unfortunately many algorithms that find near duplicate pairs of plain text documents perform poorly when used on web pages, where metadata, other extraneous information make that process much more difficult. If the content of the page (e.g., the body of a news article) can be extracted from the page, then the accuracy of the duplicate detection algorithms is greatly increased. Using machine learning techniques to identify the content portion of web pages, we achieve duplicate detection accuracy that is nearly identical to plain text, significantly better than simple heuristic approaches to content extraction. We performed these experiments on a small, but fully annotated corpus.

pdf bib
GlossaNet 2: a linguistic search engine for RSS-based corpora
Cédrick Fairon | Kévin Macé | Hubert Naets

This paper presents GlossaNet 2, a free online concordance service that enables users to search into dynamic Web corpora. Two steps are involved in using GlossaNet. At first, users define a corpus by selecting RSS feeds in a preselected pool of sources (they can also add their own RSS feeds). These sources will be visited on a regular basis by a crawler in order to generate a dynamic corpus. Secondly, the user can register one or more search queries on his / her dynamic corpus. Search queries will be re-applied on the corpus every time it is updated, new concordances will be recorded for the user (results can be emailed, published for the user in a privative RSS feed, or they can be viewed online). This service integrates two preexisting software: Corporator (Fairon, 2006), a program that creates corpora by downloading, filtering RSS feeds, Unitex (Paumier, 2003), an open source corpus processor that relies on linguistic resources. After a short introduction, we will briefly present the concept of “RSS corpora”, the assets of this approach to corpus development. We will then give an overview of the GlossaNet architecture, present various cases of use.

pdf bib
Collecting Basque specialized corpora from the web: language-specific performance tweaks, improving topic precision
I. Leturia | I. San Vicente | X. Saralegi | M. Lopez de Lacalle

The de facto standard process for collecting corpora from the Internet (with a given list of words, asking APIs of search engines for random combinations of them, downloading the returned pages) does not give very good precision when searching for texts on a certain topic., this precision is much worse when searching for corpora in the Basque language, due to certain properties inherent in the language, in the Basque web. The method proposed in this paper improves topic precision by using a sample mini-corpus as a basis for the process: the words to be used in the queries are automatically extracted from it„ a final topic-filtering step is performed using document-similarity measures with this sample corpus. We also describe the changes made to the usual process to adapt it to the peculiarities of Basque, alongside other adjustments to improve the general performance of the system, quality of the collected corpora.

pdf bib
Introducing, evaluating ukWaC, a very large web-derived corpus of English
Adriano Ferraresi | Eros Zanchetta | Marco Baroni | Silvia Bernardini

In this paper we introduce ukWaC, a large corpus of English constructed by crawling the .uk Internet domain. The corpus contains more than 2 billion tokens, is one of the largest freely available linguistic resources for English. The paper describes the tools, methodology used in the construction of the corpus, provides a qualitative evaluation of its contents, carried out through a vocabulary-based comparison with the BNC. We conclude by giving practical information about availability, format of the corpus.

pdf bib
RoDEO: Reasoning over Dependencies Extracted Online
Reda Siblini | Leila Kosseim

The web is the largest available corpus, which could be enormously valuable to many natural language processing applications. However it is becoming very difficult to identify relevant information from the web. We present a system for querying dependency tree collocations from the web. We show its usefulness in identifying relevant information by evaluating its accuracy in the task of extracting classes of named entities. The task achieved a general accuracy of 70%.