Adam Kilgarriff
2015
SemEval-2015 Task 15: A CPA dictionary-entry-building task
Vít Baisa | Jane Bradbury | Silvie Cinková | Ismaïl El Maarouf | Adam Kilgarriff | Octavian Popescu
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)
Vít Baisa | Jane Bradbury | Silvie Cinková | Ismaïl El Maarouf | Adam Kilgarriff | Octavian Popescu
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)
2014
Terminology finding in the Sketch Engine: an evaluation
Adam Kilgarriff
Proceedings of Translating and the Computer 36
Adam Kilgarriff
Proceedings of Translating and the Computer 36
Finding Terms in Corpora for Many Languages with the Sketch Engine
Miloš Jakubíček | Adam Kilgarriff | Vojtěch Kovář | Pavel Rychlý | Vít Suchomel
Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics
Miloš Jakubíček | Adam Kilgarriff | Vojtěch Kovář | Pavel Rychlý | Vít Suchomel
Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics
Extrinsic Corpus Evaluation with a Collocation Dictionary Task
Adam Kilgarriff | Pavel Rychlý | Miloš Jakubíček | Vojtěch Kovář | Vít Baisa | Lucia Kocincová
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Adam Kilgarriff | Pavel Rychlý | Miloš Jakubíček | Vojtěch Kovář | Vít Baisa | Lucia Kocincová
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
The NLP researcher or application-builder often wonders “what corpus should I use, or should I build one of my own? If I build one of my own, how will I know if I have done a good job?” Currently there is very little help available for them. They are in need of a framework for evaluating corpora. We develop such a framework, in relation to corpora which aim for good coverage of ‘general language’. The task we set is automatic creation of a publication-quality collocations dictionary. For a sample of 100 headwords of Czech and 100 of English, we identify a gold standard dataset of (ideally) all the collocations that should appear for these headwords in such a dictionary. The datasets are being made available alongside this paper. We then use them to determine precision and recall for a range of corpora, with a range of parameters.
Sublanguage Corpus Analysis Toolkit: A tool for assessing the representativeness and sublanguage characteristics of corpora
Irina Temnikova | William A. Baumgartner Jr. | Negacy D. Hailu | Ivelina Nikolova | Tony McEnery | Adam Kilgarriff | Galia Angelova | K. Bretonnel Cohen
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Irina Temnikova | William A. Baumgartner Jr. | Negacy D. Hailu | Ivelina Nikolova | Tony McEnery | Adam Kilgarriff | Galia Angelova | K. Bretonnel Cohen
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Sublanguages are varieties of language that form subsets of the general language, typically exhibiting particular types of lexical, semantic, and other restrictions and deviance. SubCAT, the Sublanguage Corpus Analysis Toolkit, assesses the representativeness and closure properties of corpora to analyze the extent to which they are either sublanguages, or representative samples of the general language. The current version of SubCAT contains scripts and applications for assessing lexical closure, morphological closure, sentence type closure, over-represented words, and syntactic deviance. Its operation is illustrated with three case studies concerning scientific journal articles, patents, and clinical records. Materials from two language families are analyzed―English (Germanic), and Bulgarian (Slavic). The software is available at sublanguage.sourceforge.net under a liberal Open Source license.
Hindi Word Sketches
Anil Krishna Eragani | Varun Kuchib Hotla | Dipti Misra Sharma | Siva Reddy | Adam Kilgarriff
Proceedings of the 11th International Conference on Natural Language Processing
Anil Krishna Eragani | Varun Kuchib Hotla | Dipti Misra Sharma | Siva Reddy | Adam Kilgarriff
Proceedings of the 11th International Conference on Natural Language Processing
2013
Terminology finding in the Sketch engine
Adam Kilgarriff
Proceedings of Translating and the Computer 35
Adam Kilgarriff
Proceedings of Translating and the Computer 35
2012
Word Sketches for Turkish
Bharat Ram Ambati | Siva Reddy | Adam Kilgarriff
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Bharat Ram Ambati | Siva Reddy | Adam Kilgarriff
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Word sketches are one-page, automatic, corpus-based summaries of a word's grammatical and collocational behaviour. In this paper we present word sketches for Turkish. Until now, word sketches have been generated using a purpose-built finite-state grammars. Here, we use an existing dependency parser. We describe the process of collecting a 42 million word corpus, parsing it, and generating word sketches from it. We evaluate the word sketches in comparison with word sketches from a language independent sketch grammar on an external evaluation task called topic coherence, using Turkish WordNet to derive an evaluation set of coherent topics.
2011
Helping Our Own: The HOO 2011 Pilot Shared Task
Robert Dale | Adam Kilgarriff
Proceedings of the 13th European Workshop on Natural Language Generation
Robert Dale | Adam Kilgarriff
Proceedings of the 13th European Workshop on Natural Language Generation
2010
A Corpus Factory for Many Languages
Adam Kilgarriff | Siva Reddy | Jan Pomikálek | Avinesh PVS
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
Adam Kilgarriff | Siva Reddy | Jan Pomikálek | Avinesh PVS
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
For many languages there are no large, general-language corpora available. Until the web, all but the institutions could do little but shake their heads in dismay as corpus-building was long, slow and expensive. But with the advent of the Web it can be highly automated and thereby fast and inexpensive. We have developed a corpus factory where we build large corpora. In this paper we describe the method we use, and how it has worked, and how various problems were solved, for eight languages: Dutch, Hindi, Indonesian, Norwegian, Swedish, Telugu, Thai and Vietnamese. We use the BootCaT method: we take a set of 'seed words' for the language from Wikipedia. Then, several hundred times over, we * randomly select three or four of the seed words * send as a query to Google or Yahoo or Bing, which returns a 'search hits' page * gather the pages that Google or Yahoo point to and save the text. This forms the corpus, which we then * 'clean' (to remove navigation bars, advertisements etc) * remove duplicates * tokenise and (if tools are available) lemmatise and part-of-speech tag * load into our corpus query tool, the Sketch Engine The corpora we have developed are available for use in the Sketch Engine corpus query tool.
A Detailed, Accurate, Extensive, Available English Lexical Database
Adam Kilgarriff
Proceedings of the NAACL HLT 2010 Demonstration Session
Adam Kilgarriff
Proceedings of the NAACL HLT 2010 Demonstration Session
Proceedings of the NAACL HLT 2010 Sixth Web as Corpus Workshop
Adam Kilgarriff | Dekang Lin
Proceedings of the NAACL HLT 2010 Sixth Web as Corpus Workshop
Adam Kilgarriff | Dekang Lin
Proceedings of the NAACL HLT 2010 Sixth Web as Corpus Workshop
Helping Our Own: Text Massaging for Computational Linguistics as a New Shared Task
Robert Dale | Adam Kilgarriff
Proceedings of the 6th International Natural Language Generation Conference
Robert Dale | Adam Kilgarriff
Proceedings of the 6th International Natural Language Generation Conference
Fast Syntactic Searching in Very Large Corpora for Many Languages
Miloš Jakubíček | Adam Kilgarriff | Diana McCarthy | Pavel Rychlý
Proceedings of the 24th Pacific Asia Conference on Language, Information and Computation
Miloš Jakubíček | Adam Kilgarriff | Diana McCarthy | Pavel Rychlý
Proceedings of the 24th Pacific Asia Conference on Language, Information and Computation
2008
Proceedings of the 4th Web as Corpus Workshop
Stefan Evert | Adam Kilgarriff | Serge Sharoff
Proceedings of the 4th Web as Corpus Workshop
Stefan Evert | Adam Kilgarriff | Serge Sharoff
Proceedings of the 4th Web as Corpus Workshop
Evaluating a German Sketch Grammar: A Case Study on Noun Phrase Case
Kremena Ivanova | Ulrich Heid | Sabine Schulte im Walde | Adam Kilgarriff | Jan Pomikálek
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
Kremena Ivanova | Ulrich Heid | Sabine Schulte im Walde | Adam Kilgarriff | Jan Pomikálek
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
Word sketches are part of the Sketch Engine corpus query system. They represent automatic, corpus-derived summaries of the words grammatical and collocational behaviour. Besides the corpus itself, word sketches require a sketch grammar, a regular expression-based shallow grammar over the part-of-speech tags, to extract evidence for the properties of the targeted words from the corpus. The paper presents a sketch grammar for German, a language which is not strictly configurational and which shows a considerable amount of case syncretism, and evaluates its accuracy, which has not been done for other sketch grammars. The evaluation focuses on NP case as a crucial part of the German grammar. We present various versions of NP definitions, so demonstrating the influence of grammar detail on precision and recall.
Cleaneval: a Competition for Cleaning Web Pages
Marco Baroni | Francis Chantree | Adam Kilgarriff | Serge Sharoff
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
Marco Baroni | Francis Chantree | Adam Kilgarriff | Serge Sharoff
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
Cleaneval is a shared task and competitive evaluation on the topic of cleaning arbitrary web pages, with the goal of preparing web data for use as a corpus for linguistic and language technology research and development. The first exercise took place in 2007. We describe how it was set up, results, and lessons learnt
2007
Last Words: Googleology is Bad Science
Adam Kilgarriff
Computational Linguistics, Volume 33, Number 1, March 2007
Adam Kilgarriff
Computational Linguistics, Volume 33, Number 1, March 2007
An efficient algorithm for building a distributional thesaurus (and other Sketch Engine developments)
Pavel Rychlý | Adam Kilgarriff
Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions
Pavel Rychlý | Adam Kilgarriff
Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions
2006
WebBootCaT. Instant Domain-Specific Corpora to Support Human Translators
Marco Baroni | Adam Kilgarriff | Jan Pomikalek | Pavel Rychly
Proceedings of the 11th Annual Conference of the European Association for Machine Translation
Marco Baroni | Adam Kilgarriff | Jan Pomikalek | Pavel Rychly
Proceedings of the 11th Annual Conference of the European Association for Machine Translation
Large Linguistically-Processed Web Corpora for Multiple Languages
Marco Baroni | Adam Kilgarriff
Demonstrations
Marco Baroni | Adam Kilgarriff
Demonstrations
Shared-Task Evaluations in HLT: Lessons for NLG
Anja Belz | Adam Kilgarriff
Proceedings of the Fourth International Natural Language Generation Conference
Anja Belz | Adam Kilgarriff
Proceedings of the Fourth International Natural Language Generation Conference
Annotated Web as corpus
Paul Rayson | James Walkerdine | William H. Fletcher | Adam Kilgarriff
Proceedings of the 2nd International Workshop on Web as Corpus
Paul Rayson | James Walkerdine | William H. Fletcher | Adam Kilgarriff
Proceedings of the 2nd International Workshop on Web as Corpus
2005
Chinese Sketch Engine and the Extraction of Grammatical Collocations
Chu-Ren Huang | Adam Kilgarriff | Yiching Wu | Chih-Ming Chiu | Simon Smith | Pavel Rychly | Ming-Hong Bai | Keh-Jiann Chen
Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing
Chu-Ren Huang | Adam Kilgarriff | Yiching Wu | Chih-Ming Chiu | Simon Smith | Pavel Rychly | Ming-Hong Bai | Keh-Jiann Chen
Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing
2004
The Senseval-3 English lexical sample task
Rada Mihalcea | Timothy Chklovski | Adam Kilgarriff
Proceedings of SENSEVAL-3, the Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text
Rada Mihalcea | Timothy Chklovski | Adam Kilgarriff
Proceedings of SENSEVAL-3, the Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text
2003
WASPBENCH: a lexicographer’s workbench incorporating state-of-the-art word sense disambiguation
Adam Kilgarriff | Roger Evans | Rob Koeling | Michael Rundell | David Tugwell
Demonstrations
Adam Kilgarriff | Roger Evans | Rob Koeling | Michael Rundell | David Tugwell
Demonstrations
Introduction to the Special Issue on the Web as Corpus
Adam Kilgarriff | Gregory Grefenstette
Computational Linguistics, Volume 29, Number 3, September 2003: Special Issue on the Web as Corpus
Adam Kilgarriff | Gregory Grefenstette
Computational Linguistics, Volume 29, Number 3, September 2003: Special Issue on the Web as Corpus
An Evaluation of a Lexicographers’ Workbench: building lexicons for Machine Translation
Rob Koeling | Adam Kilgarriff | David Tugwell | Roger Evans
Proceedings of the 7th International EAMT workshop on MT and other language technology tools, Improving MT through other language technology tools, Resource and tools for building MT at EACL 2003
Rob Koeling | Adam Kilgarriff | David Tugwell | Roger Evans
Proceedings of the 7th International EAMT workshop on MT and other language technology tools, Improving MT through other language technology tools, Resource and tools for building MT at EACL 2003
No-bureaucracy evaluation
Adam Kilgarriff
Proceedings of the EACL 2003 Workshop on Evaluation Initiatives in Natural Language Processing: are evaluation methods, metrics and resources reusable?
Adam Kilgarriff
Proceedings of the EACL 2003 Workshop on Evaluation Initiatives in Natural Language Processing: are evaluation methods, metrics and resources reusable?
2001
WASP-Bench: an MT lexicographers’ workstation supporting state-of-the-art lexical disambiguation
Adam Kilgarriff | David Tugwell
Proceedings of Machine Translation Summit VIII
Adam Kilgarriff | David Tugwell
Proceedings of Machine Translation Summit VIII
Most MT lexicography is devoted to developing rules of the kind, “in context C, translate source-language word S as target-language word T”. Very many such rules are required, producing them is laborious, and MT companies standardly spend large sums on it. We present the WASP-Bench, a lexicographer's workstation for the rapid and semi-automatic development of such rule-sets. The WASP-Bench makes use of a large source-language corpus and state-of-the-art techniques for Word Sense Disambiguation. We show that the WSD accuracy is on a par with the best results published to date, with the advantage that the WASP-Bench, unlike other high- performance systems, does not require a sense-disambiguated training corpus as input. The WASP-Bench is designed to fit readily with MT companies' working practices, as it may be used for as many or as few source language words as present disambiguation problems for a given target.
English Lexical Sample Task Description
Adam Kilgarriff
Proceedings of SENSEVAL-2 Second International Workshop on Evaluating Word Sense Disambiguation Systems
Adam Kilgarriff
Proceedings of SENSEVAL-2 Second International Workshop on Evaluating Word Sense Disambiguation Systems
WASP-Bench: a Lexicographic Tool Supporting Word Sense Disambiguation
David Tugwell | Adam Kilgarriff
Proceedings of SENSEVAL-2 Second International Workshop on Evaluating Word Sense Disambiguation Systems
David Tugwell | Adam Kilgarriff
Proceedings of SENSEVAL-2 Second International Workshop on Evaluating Word Sense Disambiguation Systems
2000
English Senseval: Report and Results
Adam Kilgarriff | Joseph Rosenzweig
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)
Adam Kilgarriff | Joseph Rosenzweig
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)
What’s in a Thesaurus?
Adam Kilgarriff | Colin Yallop
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)
Adam Kilgarriff | Colin Yallop
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)
The Concede Model for Lexical Databases
Tomaž Erjavec | Roger Evans | Nancy Ide | Adam Kilgarriff
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)
Tomaž Erjavec | Roger Evans | Nancy Ide | Adam Kilgarriff
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)
1999
95% Replicability for Manual Word Sense Tagging
Adam Kilgarriff
Ninth Conference of the European Chapter of the Association for Computational Linguistics
Adam Kilgarriff
Ninth Conference of the European Chapter of the Association for Computational Linguistics
1998
Measures for Corpus Similarity and Homogeneity
Adam Kilgarriff | Tony Rose
Proceedings of the Third Conference on Empirical Methods for Natural Language Processing
Adam Kilgarriff | Tony Rose
Proceedings of the Third Conference on Empirical Methods for Natural Language Processing
1997
Using Word Frequency Lists to Measure Corpus Homogeneity and Similarity between Corpora
Adam Kilgarriff
Fifth Workshop on Very Large Corpora
Adam Kilgarriff
Fifth Workshop on Very Large Corpora
1993
Search
Fix author
Co-authors
- Pavel Rychlý 6
- David Tugwell 4
- Marco Baroni 3
- Roger Evans 3
- Miloš Jakubíček 3
- Jan Pomikálek 3
- Siva Reddy 3
- Vít Baisa 2
- Robert Dale 2
- Rob Koeling 2
- Vojtěch Kovář 2
- Serge Sharoff 2
- Bharat Ram Ambati 1
- Galia Angelova 1
- Ming-Hong Bai 1
- William A. Baumgartner, Jr. 1
- Anja Belz 1
- Jane Bradbury 1
- Francis Chantree 1
- Keh-Jiann Chen 1
- Chih-Ming Chiu 1
- Timothy Chklovski 1
- Silvie Cinková 1
- K. Bretonnel Cohen 1
- Ismail El Maarouf 1
- Anil Krishna Eragani 1
- Tomaž Erjavec 1
- Stefan Evert 1
- William H. Fletcher 1
- Gregory Grefenstette 1
- Negacy Hailu 1
- Ulrich Heid 1
- Varun Kuchib Hotla 1
- Chu-Ren Huang 1
- Nancy Ide 1
- Kremena Ivanova 1
- Lucia Kocincová 1
- Dekang Lin 1
- Diana McCarthy 1
- Tony McEnery 1
- Rada Mihalcea 1
- Ivelina Nikolova 1
- Avinesh PVS 1
- Martha Palmer 1
- Octavian Popescu 1
- Paul Rayson 1
- Tony Rose 1
- Joseph Rosenzweig 1
- Michael Rundell 1
- Sabine Schulte im Walde 1
- Dipti Misra Sharma 1
- Simon Smith 1
- Vit Suchomel 1
- Irina Temnikova 1
- James Walkerdine 1
- Yiching Wu 1
- Colin Yallop 1