Adam Kilgarriff

2015

2014

pdf bib

Hindi Word Sketches
Anil Krishna Eragani | Varun Kuchib Hotla | Dipti Misra Sharma | Siva Reddy | Adam Kilgarriff
Proceedings of the 11th International Conference on Natural Language Processing

pdf bib abs

Sublanguage Corpus Analysis Toolkit: A tool for assessing the representativeness and sublanguage characteristics of corpora
Irina Temnikova | William A. Baumgartner Jr. | Negacy D. Hailu | Ivelina Nikolova | Tony McEnery | Adam Kilgarriff | Galia Angelova | K. Bretonnel Cohen
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Sublanguages are varieties of language that form subsets of the general language, typically exhibiting particular types of lexical, semantic, and other restrictions and deviance. SubCAT, the Sublanguage Corpus Analysis Toolkit, assesses the representativeness and closure properties of corpora to analyze the extent to which they are either sublanguages, or representative samples of the general language. The current version of SubCAT contains scripts and applications for assessing lexical closure, morphological closure, sentence type closure, over-represented words, and syntactic deviance. Its operation is illustrated with three case studies concerning scientific journal articles, patents, and clinical records. Materials from two language families are analyzed―English (Germanic), and Bulgarian (Slavic). The software is available at sublanguage.sourceforge.net under a liberal Open Source license.

pdf bib abs

The NLP researcher or application-builder often wonders “what corpus should I use, or should I build one of my own? If I build one of my own, how will I know if I have done a good job?” Currently there is very little help available for them. They are in need of a framework for evaluating corpora. We develop such a framework, in relation to corpora which aim for good coverage of ‘general language’. The task we set is automatic creation of a publication-quality collocations dictionary. For a sample of 100 headwords of Czech and 100 of English, we identify a gold standard dataset of (ideally) all the collocations that should appear for these headwords in such a dictionary. The datasets are being made available alongside this paper. We then use them to determine precision and recall for a range of corpora, with a range of parameters.

pdf bib

Finding Terms in Corpora for Many Languages with the Sketch Engine
Miloš Jakubíček | Adam Kilgarriff | Vojtěch Kovář | Pavel Rychlý | Vít Suchomel
Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics

pdf bib

Terminology finding in the Sketch Engine: an evaluation
Adam Kilgarriff
Proceedings of Translating and the Computer 36

2013

pdf bib

Terminology finding in the Sketch engine
Adam Kilgarriff
Proceedings of Translating and the Computer 35

2012

pdf bib abs

Word Sketches for Turkish
Bharat Ram Ambati | Siva Reddy | Adam Kilgarriff
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Word sketches are one-page, automatic, corpus-based summaries of a word's grammatical and collocational behaviour. In this paper we present word sketches for Turkish. Until now, word sketches have been generated using a purpose-built finite-state grammars. Here, we use an existing dependency parser. We describe the process of collecting a 42 million word corpus, parsing it, and generating word sketches from it. We evaluate the word sketches in comparison with word sketches from a language independent sketch grammar on an external evaluation task called topic coherence, using Turkish WordNet to derive an evaluation set of coherent topics.

2011

pdf bib

Helping Our Own: The HOO 2011 Pilot Shared Task
Robert Dale | Adam Kilgarriff
Proceedings of the 13th European Workshop on Natural Language Generation

2010

pdf bib

Fast Syntactic Searching in Very Large Corpora for Many Languages
Miloš Jakubíček | Adam Kilgarriff | Diana McCarthy | Pavel Rychlý
Proceedings of the 24th Pacific Asia Conference on Language, Information and Computation

pdf bib

Helping Our Own: Text Massaging for Computational Linguistics as a New Shared Task
Robert Dale | Adam Kilgarriff
Proceedings of the 6th International Natural Language Generation Conference

pdf bib

Proceedings of the NAACL HLT 2010 Sixth Web as Corpus Workshop
Adam Kilgarriff | Dekang Lin
Proceedings of the NAACL HLT 2010 Sixth Web as Corpus Workshop

pdf bib

A Detailed, Accurate, Extensive, Available English Lexical Database
Adam Kilgarriff
Proceedings of the NAACL HLT 2010 Demonstration Session

pdf bib abs

A Corpus Factory for Many Languages
Adam Kilgarriff | Siva Reddy | Jan Pomikálek | Avinesh PVS
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

For many languages there are no large, general-language corpora available. Until the web, all but the institutions could do little but shake their heads in dismay as corpus-building was long, slow and expensive. But with the advent of the Web it can be highly automated and thereby fast and inexpensive. We have developed a corpus factory where we build large corpora. In this paper we describe the method we use, and how it has worked, and how various problems were solved, for eight languages: Dutch, Hindi, Indonesian, Norwegian, Swedish, Telugu, Thai and Vietnamese. We use the BootCaT method: we take a set of 'seed words' for the language from Wikipedia. Then, several hundred times over, we * randomly select three or four of the seed words * send as a query to Google or Yahoo or Bing, which returns a 'search hits' page * gather the pages that Google or Yahoo point to and save the text. This forms the corpus, which we then * 'clean' (to remove navigation bars, advertisements etc) * remove duplicates * tokenise and (if tools are available) lemmatise and part-of-speech tag * load into our corpus query tool, the Sketch Engine The corpora we have developed are available for use in the Sketch Engine corpus query tool.

2008

pdf bib abs

Cleaneval: a Competition for Cleaning Web Pages
Marco Baroni | Francis Chantree | Adam Kilgarriff | Serge Sharoff
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Cleaneval is a shared task and competitive evaluation on the topic of cleaning arbitrary web pages, with the goal of preparing web data for use as a corpus for linguistic and language technology research and development. The first exercise took place in 2007. We describe how it was set up, results, and lessons learnt

pdf bib abs

Evaluating a German Sketch Grammar: A Case Study on Noun Phrase Case
Kremena Ivanova | Ulrich Heid | Sabine Schulte im Walde | Adam Kilgarriff | Jan Pomikálek
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Word sketches are part of the Sketch Engine corpus query system. They represent automatic, corpus-derived summaries of the words grammatical and collocational behaviour. Besides the corpus itself, word sketches require a sketch grammar, a regular expression-based shallow grammar over the part-of-speech tags, to extract evidence for the properties of the targeted words from the corpus. The paper presents a sketch grammar for German, a language which is not strictly configurational and which shows a considerable amount of case syncretism, and evaluates its accuracy, which has not been done for other sketch grammars. The evaluation focuses on NP case as a crucial part of the German grammar. We present various versions of NP definitions, so demonstrating the influence of grammar detail on precision and recall.

pdf bib

Proceedings of the 4th Web as Corpus Workshop
Stefan Evert | Adam Kilgarriff | Serge Sharoff
Proceedings of the 4th Web as Corpus Workshop

2007

pdf bib

An efficient algorithm for building a distributional thesaurus (and other Sketch Engine developments)
Pavel Rychlý | Adam Kilgarriff
Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions

pdf bib

Last Words: Googleology is Bad Science
Adam Kilgarriff
Computational Linguistics, Volume 33, Number 1, March 2007

2006

pdf bib

Annotated Web as corpus
Paul Rayson | James Walkerdine | William H. Fletcher | Adam Kilgarriff
Proceedings of the 2nd International Workshop on Web as Corpus

pdf bib

Shared-Task Evaluations in HLT: Lessons for NLG
Anja Belz | Adam Kilgarriff
Proceedings of the Fourth International Natural Language Generation Conference

pdf bib

Large Linguistically-Processed Web Corpora for Multiple Languages
Marco Baroni | Adam Kilgarriff
Demonstrations

pdf bib

WebBootCaT. Instant Domain-Specific Corpora to Support Human Translators
Marco Baroni | Adam Kilgarriff | Jan Pomikalek | Pavel Rychly
Proceedings of the 11th Annual Conference of the European Association for Machine Translation

WASPBENCH: a lexicographer’s workbench incorporating state-of-the-art word sense disambiguation
Adam Kilgarriff | Roger Evans | Rob Koeling | Michael Rundell | David Tugwell
Demonstrations

2001

pdf bib

WASP-Bench: a Lexicographic Tool Supporting Word Sense Disambiguation
David Tugwell | Adam Kilgarriff
Proceedings of SENSEVAL-2 Second International Workshop on Evaluating Word Sense Disambiguation Systems

pdf bib

English Lexical Sample Task Description
Adam Kilgarriff
Proceedings of SENSEVAL-2 Second International Workshop on Evaluating Word Sense Disambiguation Systems

pdf bib abs

WASP-Bench: an MT lexicographers’ workstation supporting state-of-the-art lexical disambiguation
Adam Kilgarriff | David Tugwell
Proceedings of Machine Translation Summit VIII

Most MT lexicography is devoted to developing rules of the kind, “in context C, translate source-language word S as target-language word T”. Very many such rules are required, producing them is laborious, and MT companies standardly spend large sums on it. We present the WASP-Bench, a lexicographer's workstation for the rapid and semi-automatic development of such rule-sets. The WASP-Bench makes use of a large source-language corpus and state-of-the-art techniques for Word Sense Disambiguation. We show that the WSD accuracy is on a par with the best results published to date, with the advantage that the WASP-Bench, unlike other high- performance systems, does not require a sense-disambiguated training corpus as input. The WASP-Bench is designed to fit readily with MT companies' working practices, as it may be used for as many or as few source language words as present disambiguation problems for a given target.