Jeroen Geertzen

2014

Native Language Identification (NLI) is a task aimed at determining the native language (L1) of learners of second language (L2) on the basis of their written texts. To date, research on NLI has focused on relatively small corpora. We apply NLI to the recently released EFCamDat corpus which is not only multiple times larger than previous L2 corpora but also provides longitudinal data at several proficiency levels. Our investigation using accurate machine learning with a wide range of linguistic features reveals interesting patterns in the longitudinal data which are useful for both further development of NLI and its application to research on L2 acquisition.

2010

pdf bib abs
LIPS: A Tool for Predicting the Lexical Isolation Point of a Word
Andrew Thwaites | Jeroen Geertzen | William D. Marslen-Wilson | Paula Buttery
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We present LIPS (Lexical Isolation Point Software), a tool for accurate lexical isolation point (IP) prediction in recordings of speech. The IP is the point in time in which a word is correctly recognised given the acoustic evidence available to the hearer. The ability to accurately determine lexical IPs is of importance to work in the field of cognitive processing, since it enables the evaluation of competing models of word recognition. IPs are also of importance in the field of neurolinguistics, where the analyses of high-temporal-resolution neuroimaging data require a precise time alignment of the observed brain activity with the linguistic input. LIPS provides an attractive alternative to costly multi-participant perception experiments by automatically computing IPs for arbitrary words. On a test set of words, the LIPS system predicts IPs with a mean difference from the actual IP of within 1ms. The difference from the predicted and actual IP approximate to a normal distribution with a standard deviation of around 80ms (depending on the model used).

pdf bib abs
The Cambridge Cookie-Theft Corpus: A Corpus of Directed and Spontaneous Speech of Brain-Damaged Patients and Healthy Individuals
Caroline Williams | Andrew Thwaites | Paula Buttery | Jeroen Geertzen | Billi Randall | Meredith Shafto | Barry Devereux | Lorraine Tyler
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Investigating differences in linguistic usage between individuals who have suffered brain injury (hereafter patients) and those who havent can yield a number of benefits. It provides a better understanding about the precise way in which impairments affect patients language, improves theories of how the brain processes language, and offers heuristics for diagnosing certain types of brain damage based on patients speech. One method for investigating usage differences involves the analysis of spontaneous speech. In the work described here we construct a text corpus consisting of transcripts of individuals speech produced during two tasks: the Boston-cookie-theft picture description task (Goodglass and Kaplan, 1983) and a spontaneous speech task, which elicits a semi-prompted monologue, and/or free speech. Interviews with patients from 19yrs to 89yrs were transcribed, as were interviews with a comparable number of healthy individuals (20yrs to 89yrs). Structural brain images are available for approximately 30% of participants. This unique data source provides a rich resource for future research in many areas of language impairment and has been constructed to facilitate analysis with natural language processing and corpus linguistics techniques.

2009

pdf bib
Dialogue Act Prediction Using Stochastic Context-Free Grammar Induction
Jeroen Geertzen
Proceedings of the EACL 2009 Workshop on Computational Linguistic Aspects of Grammatical Inference

pdf bib
Semantic interpretation of Dutch spoken dialogue (short paper)
Jeroen Geertzen
Proceedings of the Eight International Conference on Computational Semantics

pdf bib
Wide-coverage parsing of speech transcripts
Jeroen Geertzen
Proceedings of the 11th International Conference on Parsing Technologies (IWPT’09)

2008

pdf bib abs
Evaluating Dialogue Act Tagging with Naive and Expert Annotators
Jeroen Geertzen | Volha Petukhova | Harry Bunt
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In this paper the dialogue act annotation of naive and expert annotators, both annotating the same data, are compared in order to characterise the insights annotations made by different kind of annotators may provide for evaluating dialogue act tagsets. It is argued that the agreement among naive annotators provides insight in the clarity of the tagset, whereas agreement among expert annotators provides an indication of how reliably the tagset can be applied when errors are ruled out that are due to deficiencies in understanding the concepts of the tagset, to a lack of experience in using the annotation tool, or to little experience in annotation more generally. An indication of the differences between the two groups in terms of inter-annotator agreement and tagging accuracy on task-oriented dialogue in different domains, annotated with the DIT++ dialogue act tagset is presented, and the annotations of both groups are assessed against a gold standard. Additionally, the effect of the reduction of the tagsets granularity on the performances of both groups is looked into. In general, it is concluded that the annotations of both groups provide complementary insights in reliability, clarity, and more fundamental conceptual issues.