Adam Kilgarriff


2015

pdf bib
SemEval-2015 Task 15: A CPA dictionary-entry-building task
Vít Baisa | Jane Bradbury | Silvie Cinková | Ismaïl El Maarouf | Adam Kilgarriff | Octavian Popescu
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

2014

pdf bib
Extrinsic Corpus Evaluation with a Collocation Dictionary Task
Adam Kilgarriff | Pavel Rychlý | Miloš Jakubíček | Vojtěch Kovář | Vít Baisa | Lucia Kocincová
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The NLP researcher or application-builder often wonders “what corpus should I use, or should I build one of my own? If I build one of my own, how will I know if I have done a good job?” Currently there is very little help available for them. They are in need of a framework for evaluating corpora. We develop such a framework, in relation to corpora which aim for good coverage of ‘general language’. The task we set is automatic creation of a publication-quality collocations dictionary. For a sample of 100 headwords of Czech and 100 of English, we identify a gold standard dataset of (ideally) all the collocations that should appear for these headwords in such a dictionary. The datasets are being made available alongside this paper. We then use them to determine precision and recall for a range of corpora, with a range of parameters.

pdf bib
Sublanguage Corpus Analysis Toolkit: A tool for assessing the representativeness and sublanguage characteristics of corpora
Irina Temnikova | William A. Baumgartner Jr. | Negacy D. Hailu | Ivelina Nikolova | Tony McEnery | Adam Kilgarriff | Galia Angelova | K. Bretonnel Cohen
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Sublanguages are varieties of language that form “subsets” of the general language, typically exhibiting particular types of lexical, semantic, and other restrictions and deviance. SubCAT, the Sublanguage Corpus Analysis Toolkit, assesses the representativeness and closure properties of corpora to analyze the extent to which they are either sublanguages, or representative samples of the general language. The current version of SubCAT contains scripts and applications for assessing lexical closure, morphological closure, sentence type closure, over-represented words, and syntactic deviance. Its operation is illustrated with three case studies concerning scientific journal articles, patents, and clinical records. Materials from two language families are analyzed―English (Germanic), and Bulgarian (Slavic). The software is available at sublanguage.sourceforge.net under a liberal Open Source license.

pdf bib
Hindi Word Sketches
Anil Krishna Eragani | Varun Kuchib Hotla | Dipti Misra Sharma | Siva Reddy | Adam Kilgarriff
Proceedings of the 11th International Conference on Natural Language Processing

pdf bib
Terminology finding in the Sketch Engine: an evaluation
Adam Kilgarriff
Proceedings of Translating and the Computer 36

pdf bib
Finding Terms in Corpora for Many Languages with the Sketch Engine
Miloš Jakubíček | Adam Kilgarriff | Vojtěch Kovář | Pavel Rychlý | Vít Suchomel
Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics

2013

pdf bib
Terminology finding in the Sketch engine
Adam Kilgarriff
Proceedings of Translating and the Computer 35

2012

pdf bib
Word Sketches for Turkish
Bharat Ram Ambati | Siva Reddy | Adam Kilgarriff
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Word sketches are one-page, automatic, corpus-based summaries of a word's grammatical and collocational behaviour. In this paper we present word sketches for Turkish. Until now, word sketches have been generated using a purpose-built finite-state grammars. Here, we use an existing dependency parser. We describe the process of collecting a 42 million word corpus, parsing it, and generating word sketches from it. We evaluate the word sketches in comparison with word sketches from a language independent sketch grammar on an external evaluation task called topic coherence, using Turkish WordNet to derive an evaluation set of coherent topics.

2011

pdf bib
Helping Our Own: The HOO 2011 Pilot Shared Task
Robert Dale | Adam Kilgarriff
Proceedings of the 13th European Workshop on Natural Language Generation

2010

pdf bib
A Detailed, Accurate, Extensive, Available English Lexical Database
Adam Kilgarriff
Proceedings of the NAACL HLT 2010 Demonstration Session

pdf bib
Fast Syntactic Searching in Very Large Corpora for Many Languages
Miloš Jakubíček | Adam Kilgarriff | Diana McCarthy | Pavel Rychlý
Proceedings of the 24th Pacific Asia Conference on Language, Information and Computation

pdf bib
Proceedings of the NAACL HLT 2010 Sixth Web as Corpus Workshop
Adam Kilgarriff | Dekang Lin
Proceedings of the NAACL HLT 2010 Sixth Web as Corpus Workshop

pdf bib
Helping Our Own: Text Massaging for Computational Linguistics as a New Shared Task
Robert Dale | Adam Kilgarriff
Proceedings of the 6th International Natural Language Generation Conference

pdf bib
A Corpus Factory for Many Languages
Adam Kilgarriff | Siva Reddy | Jan Pomikálek | Avinesh PVS
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

For many languages there are no large, general-language corpora available. Until the web, all but the institutions could do little but shake their heads in dismay as corpus-building was long, slow and expensive. But with the advent of the Web it can be highly automated and thereby fast and inexpensive. We have developed a ‘corpus factory’ where we build large corpora. In this paper we describe the method we use, and how it has worked, and how various problems were solved, for eight languages: Dutch, Hindi, Indonesian, Norwegian, Swedish, Telugu, Thai and Vietnamese. We use the BootCaT method: we take a set of 'seed words' for the language from Wikipedia. Then, several hundred times over, we * randomly select three or four of the seed words * send as a query to Google or Yahoo or Bing, which returns a 'search hits' page * gather the pages that Google or Yahoo point to and save the text. This forms the corpus, which we then * 'clean' (to remove navigation bars, advertisements etc) * remove duplicates * tokenise and (if tools are available) lemmatise and part-of-speech tag * load into our corpus query tool, the Sketch Engine The corpora we have developed are available for use in the Sketch Engine corpus query tool.

2008

pdf bib
Evaluating a German Sketch Grammar: A Case Study on Noun Phrase Case
Kremena Ivanova | Ulrich Heid | Sabine Schulte im Walde | Adam Kilgarriff | Jan Pomikálek
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Word sketches are part of the Sketch Engine corpus query system. They represent automatic, corpus-derived summaries of the words’ grammatical and collocational behaviour. Besides the corpus itself, word sketches require a sketch grammar, a regular expression-based shallow grammar over the part-of-speech tags, to extract evidence for the properties of the targeted words from the corpus. The paper presents a sketch grammar for German, a language which is not strictly configurational and which shows a considerable amount of case syncretism, and evaluates its accuracy, which has not been done for other sketch grammars. The evaluation focuses on NP case as a crucial part of the German grammar. We present various versions of NP definitions, so demonstrating the influence of grammar detail on precision and recall.

pdf bib
Cleaneval: a Competition for Cleaning Web Pages
Marco Baroni | Francis Chantree | Adam Kilgarriff | Serge Sharoff
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Cleaneval is a shared task and competitive evaluation on the topic of cleaning arbitrary web pages, with the goal of preparing web data for use as a corpus for linguistic and language technology research and development. The first exercise took place in 2007. We describe how it was set up, results, and lessons learnt

2007

pdf bib
Last Words: Googleology is Bad Science
Adam Kilgarriff
Computational Linguistics, Volume 33, Number 1, March 2007

pdf bib
An efficient algorithm for building a distributional thesaurus (and other Sketch Engine developments)
Pavel Rychlý | Adam Kilgarriff
Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions

2006

pdf bib
Large Linguistically-Processed Web Corpora for Multiple Languages
Marco Baroni | Adam Kilgarriff
Demonstrations

pdf bib
Shared-Task Evaluations in HLT: Lessons for NLG
Anja Belz | Adam Kilgarriff
Proceedings of the Fourth International Natural Language Generation Conference

pdf bib
Annotated Web as corpus
Paul Rayson | James Walkerdine | William H. Fletcher | Adam Kilgarriff
Proceedings of the 2nd International Workshop on Web as Corpus

pdf bib
WebBootCaT. Instant Domain-Specific Corpora to Support Human Translators
Marco Baroni | Adam Kilgarriff | Jan Pomikalek | Pavel Rychly
Proceedings of the 11th Annual Conference of the European Association for Machine Translation

2005

pdf bib
Chinese Sketch Engine and the Extraction of Grammatical Collocations
Chu-Ren Huang | Adam Kilgarriff | Yiching Wu | Chih-Ming Chiu | Simon Smith | Pavel Rychly | Ming-Hong Bai | Keh-Jiann Chen
Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing

2004

pdf bib
The Senseval-3 English lexical sample task
Rada Mihalcea | Timothy Chklovski | Adam Kilgarriff
Proceedings of SENSEVAL-3, the Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text

2003

pdf bib
WASPBENCH: a lexicographer’s workbench incorporating state-of-the-art word sense disambiguation
Adam Kilgarriff | Roger Evans | Rob Koeling | Michael Rundell | David Tugwell
Demonstrations

pdf bib
An Evaluation of a Lexicographers’ Workbench: building lexicons for Machine Translation
Rob Koeling | Adam Kilgarriff | David Tugwell | Roger Evans
Proceedings of the 7th International EAMT workshop on MT and other language technology tools, Improving MT through other language technology tools, Resource and tools for building MT at EACL 2003

pdf bib
No-bureaucracy evaluation
Adam Kilgarriff
Proceedings of the EACL 2003 Workshop on Evaluation Initiatives in Natural Language Processing: are evaluation methods, metrics and resources reusable?

pdf bib
Introduction to the Special Issue on the Web as Corpus
Adam Kilgarriff | Gregory Grefenstette
Computational Linguistics, Volume 29, Number 3, September 2003: Special Issue on the Web as Corpus

2001

pdf bib
WASP-Bench: an MT lexicographers’ workstation supporting state-of-the-art lexical disambiguation
Adam Kilgarriff | David Tugwell
Proceedings of Machine Translation Summit VIII

Most MT lexicography is devoted to developing rules of the kind, “in context C, translate source-language word S as target-language word T”. Very many such rules are required, producing them is laborious, and MT companies standardly spend large sums on it. We present the WASP-Bench, a lexicographer's workstation for the rapid and semi-automatic development of such rule-sets. The WASP-Bench makes use of a large source-language corpus and state-of-the-art techniques for Word Sense Disambiguation. We show that the WSD accuracy is on a par with the best results published to date, with the advantage that the WASP-Bench, unlike other high- performance systems, does not require a sense-disambiguated training corpus as input. The WASP-Bench is designed to fit readily with MT companies' working practices, as it may be used for as many or as few source language words as present disambiguation problems for a given target.

pdf bib
English Lexical Sample Task Description
Adam Kilgarriff
Proceedings of SENSEVAL-2 Second International Workshop on Evaluating Word Sense Disambiguation Systems

pdf bib
WASP-Bench: a Lexicographic Tool Supporting Word Sense Disambiguation
David Tugwell | Adam Kilgarriff
Proceedings of SENSEVAL-2 Second International Workshop on Evaluating Word Sense Disambiguation Systems

2000

pdf bib
English Senseval: Report and Results
Adam Kilgarriff | Joseph Rosenzweig
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)

pdf bib
What’s in a Thesaurus?
Adam Kilgarriff | Colin Yallop
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)

pdf bib
The Concede Model for Lexical Databases
Tomaž Erjavec | Roger Evans | Nancy Ide | Adam Kilgarriff
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)

1999

pdf bib
95% Replicability for Manual Word Sense Tagging
Adam Kilgarriff
Ninth Conference of the European Chapter of the Association for Computational Linguistics

1998

bib
Proceedings of the Pilot SENSEVAL
Adam Kilgarriff | Martha Palmer
Proceedings of the Pilot SENSEVAL

pdf bib
Measures for Corpus Similarity and Homogeneity
Adam Kilgarriff | Tony Rose
Proceedings of the Third Conference on Empirical Methods for Natural Language Processing

1997

pdf bib
Using Word Frequency Lists to Measure Corpus Homogeneity and Similarity between Corpora
Adam Kilgarriff
Fifth Workshop on Very Large Corpora

1993

pdf bib
Inheriting Verb Alternations
Adam Kilgarriff
Sixth Conference of the European Chapter of the Association for Computational Linguistics