Felix Bildhauer
2020
Proceedings of the 12th Web as Corpus Workshop
Adrien Barbaresi | Felix Bildhauer | Roland Schäfer | Egon Stemle
Proceedings of the 12th Web as Corpus Workshop
Adrien Barbaresi | Felix Bildhauer | Roland Schäfer | Egon Stemle
Proceedings of the 12th Web as Corpus Workshop
2017
Data point selection for genre-aware parsing
Ines Rehbein | Felix Bildhauer
Proceedings of the 16th International Workshop on Treebanks and Linguistic Theories
Ines Rehbein | Felix Bildhauer
Proceedings of the 16th International Workshop on Treebanks and Linguistic Theories
2016
Automatic Classification by Topic Domain for Meta Data Generation, Web Corpus Evaluation, and Corpus Comparison
Roland Schäfer | Felix Bildhauer
Proceedings of the 10th Web as Corpus Workshop
Roland Schäfer | Felix Bildhauer
Proceedings of the 10th Web as Corpus Workshop
2014
Focused Web Corpus Crawling
Roland Schäfer | Adrien Barbaresi | Felix Bildhauer
Proceedings of the 9th Web as Corpus Workshop (WaC-9)
Roland Schäfer | Adrien Barbaresi | Felix Bildhauer
Proceedings of the 9th Web as Corpus Workshop (WaC-9)
Proceedings of the 9th Web as Corpus Workshop (WaC-9)
Felix Bildhauer | Roland Schäfer
Proceedings of the 9th Web as Corpus Workshop (WaC-9)
Felix Bildhauer | Roland Schäfer
Proceedings of the 9th Web as Corpus Workshop (WaC-9)
2013
Identifying "aboutness topics": two annotation experiments
Philippa Cook | Felix Bildhauer
Dialogue Discourse Volume 4
Philippa Cook | Felix Bildhauer
Dialogue Discourse Volume 4
This paper deals with the annotation of "aboutness topic" (also known as "sentence topic") in naturally occurring data. We report on two annotation experiments in which relatively poor inter-rater agreement was attained for the annotation of topics, although the coders were adhering to the same annotation instructions in each experiment. After presenting some theoretical background on the notion of topic in linguistics, we present the first experiment. Tokens that prove particularly difficult to assess in that experiment are identified, systematized, and discussed in some detail. In sum, the cases that were most likely to lead to non-matching annotations are those that either require a decision between "thetic" or "topic-comment", or involve an overlap between focus and topic. In order to try and increase inter-rater agreement, we modified the annotation guidelines; trying to eliminate some of the confounds from the first experiment. We then trained other annotators to use the modified guidelines and set them an annotation task. Again, the degree of inter-rater agreement was slightly disappointing. We discuss what we believe to be the problem cases in this task and give some guidance for future modification of the guidelines. The findings raise a number of issues that may contribute to the discussion in theoretical linguistics, and they also may alert other researchers planning a similar enterprise to some pitfalls they may encounter.
2012
Building Large Corpora from the Web Using a New Efficient Tool Chain
Roland Schäfer | Felix Bildhauer
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Roland Schäfer | Felix Bildhauer
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Over the last decade, methods of web corpus construction and the evaluation of web corpora have been actively researched. Prominently, the WaCky initiative has provided both theoretical results and a set of web corpora for selected European languages. We present a software toolkit for web corpus construction and a set of siginificantly larger corpora (up to over 9 billion tokens) built using this software. First, we discuss how the data should be collected to ensure that it is not biased towards certain hosts. Then, we describe our software toolkit which performs basic cleanups as well as boilerplate removal, simple connected text detection as well as shingling to remove duplicates from the corpora. We finally report evaluation results of the corpora built so far, for example w.r.t. the amount of duplication contained and the text type/genre distribution. Where applicable, we compare our corpora to the WaCky corpora, since it is inappropriate, in our view, to compare web corpora to traditional or balanced corpora. While we use some methods applied by the WaCky initiative, we can show that we have introduced incremental improvements.