Mark Hepple


2018

pdf bib
Transferred Embeddings for Igbo Similarity, Analogy, and Diacritic Restoration Tasks
Ignatius Ezeani | Ikechukwu Onyenwe | Mark Hepple
Proceedings of the Third Workshop on Semantic Deep Learning

Existing NLP models are mostly trained with data from well-resourced languages. Most minority languages face the challenge of lack of resources - data and technologies - for NLP research. Building these resources from scratch for each minority language will be very expensive, time-consuming and amount largely to unnecessarily re-inventing the wheel. In this paper, we applied transfer learning techniques to create Igbo word embeddings from a variety of existing English trained embeddings. Transfer learning methods were also used to build standard datasets for Igbo word similarity and analogy tasks for intrinsic evaluation of embeddings. These projected embeddings were also applied to diacritic restoration task. Our results indicate that the projected models not only outperform the trained ones on the semantic-based tasks of analogy, word-similarity, and odd-word identifying, but they also achieve enhanced performance on the diacritic restoration with learned diacritic embeddings.

pdf bib
Igbo Diacritic Restoration using Embedding Models
Ignatius Ezeani | Mark Hepple | Ikechukwu Onyenwe | Enemouh Chioma
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop

Igbo is a low-resource language spoken by approximately 30 million people worldwide. It is the native language of the Igbo people of south-eastern Nigeria. In Igbo language, diacritics - orthographic and tonal - play a huge role in the distinguishing the meaning and pronunciation of words. Omitting diacritics in texts often leads to lexical ambiguity. Diacritic restoration is a pre-processing task that replaces missing diacritics on words from which they have been removed. In this work, we applied embedding models to the diacritic restoration task and compared their performances to those of n-gram models. Although word embedding models have been successfully applied to various NLP tasks, it has not been used, to our knowledge, for diacritic restoration. Two classes of word embeddings models were used: those projected from the English embedding space; and those trained with Igbo bible corpus (≈ 1m). Our best result, 82.49%, is an improvement on the baseline n-gram models.

2017

pdf bib
Lexical Disambiguation of Igbo using Diacritic Restoration
Ignatius Ezeani | Mark Hepple | Ikechukwu Onyenwe
Proceedings of the 1st Workshop on Sense, Concept and Entity Representations and their Applications

Properly written texts in Igbo, a low-resource African language, are rich in both orthographic and tonal diacritics. Diacritics are essential in capturing the distinctions in pronunciation and meaning of words, as well as in lexical disambiguation. Unfortunately, most electronic texts in diacritic languages are written without diacritics. This makes diacritic restoration a necessary step in corpus building and language processing tasks for languages with diacritics. In our previous work, we built some n-gram models with simple smoothing techniques based on a closed-world assumption. However, as a classification task, diacritic restoration is well suited for and will be more generalisable with machine learning. This paper, therefore, presents a more standard approach to dealing with the task which involves the application of machine learning algorithms.

2016

pdf bib
What’s the Issue Here?: Task-based Evaluation of Reader Comment Summarization Systems
Emma Barker | Monica Paramita | Adam Funk | Emina Kurtic | Ahmet Aker | Jonathan Foster | Mark Hepple | Robert Gaizauskas
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Automatic summarization of reader comments in on-line news is an extremely challenging task and a capability for which there is a clear need. Work to date has focussed on producing extractive summaries using well-known techniques imported from other areas of language processing. But are extractive summaries of comments what users really want? Do they support users in performing the sorts of tasks they are likely to want to perform with reader comments? In this paper we address these questions by doing three things. First, we offer a specification of one possible summary type for reader comment, based on an analysis of reader comment in terms of issues and viewpoints. Second, we define a task-based evaluation framework for reader comment summarization that allows summarization systems to be assessed in terms of how well they support users in a time-limited task of identifying issues and characterising opinion on issues in comments. Third, we describe a pilot evaluation in which we used the task-based evaluation framework to evaluate a prototype reader comment clustering and summarization system, demonstrating the viability of the evaluation framework and illustrating the sorts of insight such an evaluation affords.

pdf bib
Studying the Temporal Dynamics of Word Co-occurrences: An Application to Event Detection
Daniel Preoţiuc-Pietro | P. K. Srijith | Mark Hepple | Trevor Cohn
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Streaming media provides a number of unique challenges for computational linguistics. This paper studies the temporal variation in word co-occurrence statistics, with application to event detection. We develop a spectral clustering approach to find groups of mutually informative terms occurring in discrete time frames. Experiments on large datasets of tweets show that these groups identify key real world events as they occur in time, despite no explicit supervision. The performance of our method rivals state-of-the-art methods for event detection on F-score, obtaining higher recall at the expense of precision.

pdf bib
The SENSEI Annotated Corpus: Human Summaries of Reader Comment Conversations in On-line News
Emma Barker | Monica Lestari Paramita | Ahmet Aker | Emina Kurtic | Mark Hepple | Robert Gaizauskas
Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue

pdf bib
Automatic label generation for news comment clusters
Ahmet Aker | Monica Paramita | Emina Kurtic | Adam Funk | Emma Barker | Mark Hepple | Rob Gaizauskas
Proceedings of the 9th International Natural Language Generation conference

2015

pdf bib
Comment-to-Article Linking in the Online News Domain
Ahmet Aker | Emina Kurtic | Mark Hepple | Rob Gaizauskas | Giuseppe Di Fabbrizio
Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue

pdf bib
Use of Transformation-Based Learning in Annotation Pipeline of Igbo, an African Language
Ikechukwu Onyenwe | Mark Hepple | Chinedu Uchechukwu | Ignatius Ezeani
Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects

2014

pdf bib
Part-of-speech Tagset and Corpus Development for Igbo, an African Language
Ikechukwu Onyenwe | Chinedu Uchechukwu | Mark Hepple
Proceedings of LAW VIII - The 8th Linguistic Annotation Workshop

2010

pdf bib
Evaluating Lexical Substitution: Analysis and New Measures
Sanaz Jabbari | Mark Hepple | Louise Guthrie
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Lexical substitution is the task of finding a replacement for a target word in a sentence so as to preserve, as closely as possible, the meaning of the original sentence. It has been proposed that lexical substitution be used as a basis for assessing the performance of word sense disambiguation systems, an idea realised in the English Lexical Substitution Task of SemEval-2007. In this paper, we examine the evaluation metrics used for the English Lexical Substitution Task and identify some problems that arise for them. We go on to propose some alternative measures for this purpose, that avoid these problems, and which in turn can be seen as redefining the key tasks that lexical substitution systems should be expected to perform. We hope that these new metrics will better serve to guide the development of lexical substitution systems in future work. One of the new metrics addresses how effective systems are in ranking substitution candidates, a key ability for lexical substitution systems, and we report some results concerning the assessment of systems produced by this measure as compared to the relevant measure from SemEval-2007.

pdf bib
Efficient Minimal Perfect Hash Language Models
David Guthrie | Mark Hepple | Wei Liu
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

The availability of large collections of text have made it possible to build language models that incorporate counts of billions of n-grams. This paper proposes two new methods of efficiently storing large language models that allow O(1) random access and use significantly less space than all known approaches. We introduce two novel data structures that take advantage of the distribution of n-grams in corpora and make use of various numbers of minimal perfect hashes to compactly store language models containing full frequency counts of billions of n-grams using 2.5 Bytes per n-gram and language models of quantized probabilities using 2.26 Bytes per n-gram. These methods allow language processing applications to take advantage of much larger language models than previously was possible using the same hardware and we additionally describe how they can be used in a distributed environment to store even larger models. We show that our approaches are simple to implement and can easily be combined with pruning and quantization to achieve additional reductions in the size of the language model.

pdf bib
Evaluation Metrics for the Lexical Substitution Task
Sanaz Jabbari | Mark Hepple | Louise Guthrie
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

pdf bib
Storing the Web in Memory: Space Efficient Language Models with Constant Time Retrieval
David Guthrie | Mark Hepple
Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing

2008

pdf bib
Combining Terminology Resources and Statistical Methods for Entity Recognition: an Evaluation
Angus Roberts | Robert Gaizasukas | Mark Hepple | Yikun Guo
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Terminologies and other knowledge resources are widely used to aid entity recognition in specialist domain texts. As well as providing lexicons of specialist terms, linkage from the text back to a resource can make additional knowledge available to applications. Use of such resources is especially pertinent in the biomedical domain, where large numbers of these resources are available, and where they are widely used in informatics applications. Terminology resources can be most readily used by simple lexical lookup of terms in the text. A major drawback with such lexical lookup, however, is poor precision caused by ambiguity between domain terms and general language words. We combine lexical lookup with simple filtering of ambiguous terms, to improve precision. We compare this lexical lookup with a statistical method of entity recognition, and to a method which combines the two approaches. We show that the combined method boosts precision with little loss of recall, and that linkage from recognised entities back to the domain knowledge resources can be maintained.

pdf bib
Cross-Domain Dialogue Act Tagging
Nick Webb | Ting Liu | Mark Hepple | Yorick Wilks
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

We present recent work in the area of Cross-Domain Dialogue Act (DA) tagging. We have previously reported on the use of a simple dialogue act classifier based on purely intra-utterance features - principally involving word n-gram cue phrases automatically generated from a training corpus. Such a classifier performs surprisingly well, rivalling scores obtained using far more sophisticated language modelling techniques. In this paper, we apply these automatically extracted cues to a new annotated corpus, to determine the portability and generality of the cues we learn.

pdf bib
Extracting Clinical Relationships from Patient Narratives
Angus Roberts | Robert Gaizauskas | Mark Hepple
Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing

2007

pdf bib
SemEval-2007 Task 15: TempEval Temporal Relation Identification
Marc Verhagen | Robert Gaizauskas | Frank Schilder | Mark Hepple | Graham Katz | James Pustejovsky
Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007)

pdf bib
USFD: Preliminary Exploration of Features and Classifiers for the TempEval-2007 Task
Mark Hepple | Andrea Setzer | Robert Gaizauskas
Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007)

2005

pdf bib
SUPPLE: A Practical Parser for Natural Language Engineering Applications
Robert Gaizauskas | Mark Hepple | Horacio Saggion | Mark A. Greenwood | Kevin Humphreys
Proceedings of the Ninth International Workshop on Parsing Technology

2004

pdf bib
Human Dialogue Modelling Using Annotated Corpora
Yorick Wilks | Nick Webb | Andrea Setzer | Mark Hepple | Roberta Catizone
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

pdf bib
A Large-Scale Resource for Storing and Recognizing Technical Terminology
Henk Harkema | Robert Gaizauskas | Mark Hepple | Neil Davis | Yikun Guo | Angus Roberts | Ian Roberts
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

pdf bib
NLP-enhanced Content Filtering Within the POESIA Project
Mark Hepple | Neil Ireson | Paolo Allegrini | Simone Marchi | Simonetta Montemagni | Jose Maria Gomez Hidalgo
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

pdf bib
A Large Scale Terminology Resource for Biomedical Text Processing
Henk Harkema | Robert Gaizauskas | Mark Hepple | Angus Roberts | Ian Roberts | Neil Davis | Yikun Guo
HLT-NAACL 2004 Workshop: Linking Biological Literature, Ontologies and Databases

2000

pdf bib
Independence and Commitment: Assumptions for Rapid Training and Execution of Rule-based POS Taggers
Mark Hepple
Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics

1999

pdf bib
An Earley-style Predictive Chart Parsing Method for Lambek Grammars
Mark Hepple
Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics

1998

pdf bib
Memoisation for Glue Language Deduction and Categorial Parsing
Mark Hepple
36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 1

pdf bib
Compacting the Penn Treebank Grammar
Alexander Krotov | Mark Hepple | Robert Gaizauskas | Yorick Wilks
36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 1

pdf bib
Memoisation for Glue Language Deduction and Categorial Parsing
Mark Hepple
COLING 1998 Volume 1: The 17th International Conference on Computational Linguistics

pdf bib
Compacting the Penn Treebank Grammar
Alexander Krotov | Mark Hepple | Robert Gaizauskas | Yorick Wilks
COLING 1998 Volume 1: The 17th International Conference on Computational Linguistics

pdf bib
On some similarities between D-tree grammars and type-logical grammars
Mark Hepple
Proceedings of the Fourth International Workshop on Tree Adjoining Grammars and Related Frameworks (TAG+4)

1997

pdf bib
Maximal Incrementality in Linear Categorial Deduction
Mark Hepple
35th Annual Meeting of the Association for Computational Linguistics and 8th Conference of the European Chapter of the Association for Computational Linguistics

1996

pdf bib
A Compilation-Chart Method for Linear Categorial Deduction
Mark Hepple
COLING 1996 Volume 1: The 16th International Conference on Computational Linguistics

1995

pdf bib
Mixing Modes of Linguistic Description in Categorial Grammar
Mark Hepple
Seventh Conference of the European Chapter of the Association for Computational Linguistics

1994

pdf bib
Discontinuity and the Lambek Calculus
Mark Hepple
COLING 1994 Volume 2: The 15th International Conference on Computational Linguistics

1992

pdf bib
Chart Parsing Lambek Grammars: Modal Extensions and Incrementality
Mark Hepple
COLING 1992 Volume 1: The 14th International Conference on Computational Linguistics

1991

pdf bib
Efficient Incremental Processing With Categorial Grammar
Mark Hepple
29th Annual Meeting of the Association for Computational Linguistics

pdf bib
Proof Figures and Structural Operators for Categorial Grammar
Guy Barry | Mark Hepple | Neil Leslie | Glyn Morrill
Fifth Conference of the European Chapter of the Association for Computational Linguistics

1990

pdf bib
Normal Form Theorem Proving for the Lambek Calculus
Mark Hepple
COLING 1990 Volume 2: Papers presented to the 13th International Conference on Computational Linguistics

1989

pdf bib
Parsing and Derivational Equivalence
Mark Hepple | Glyn Morrill
Fourth Conference of the European Chapter of the Association for Computational Linguistics