Gertjan van Noord

Also published as: Gertjan Van Noord


2024

pdf bib
Are Character-level Translations Worth the Wait? Comparing ByT5 and mT5 for Machine Translation
Lukas Edman | Gabriele Sarti | Antonio Toral | Gertjan van Noord | Arianna Bisazza
Transactions of the Association for Computational Linguistics, Volume 12

Pretrained character-level and byte-level language models have been shown to be competitive with popular subword models across a range of Natural Language Processing tasks. However, there has been little research on their effectiveness for neural machine translation (NMT), particularly within the popular pretrain-then-finetune paradigm. This work performs an extensive comparison across multiple languages and experimental conditions of character- and subword-level pretrained models (ByT5 and mT5, respectively) on NMT. We show the effectiveness of character-level modeling in translation, particularly in cases where fine-tuning data is limited. In our analysis, we show how character models’ gains in translation quality are reflected in better translations of orthographically similar words and rare words. While evaluating the importance of source texts in driving model predictions, we highlight word-level patterns within ByT5, suggesting an ability to modulate word-level and character-level information during generation. We conclude by assessing the efficiency tradeoff of byte models, suggesting their usage in non-time-critical scenarios to boost translation quality.

pdf bib
Endowing Neural Language Learners with Human-like Biases: A Case Study on Dependency Length Minimization
Yuqing Zhang | Tessa Verhoef | Gertjan van Noord | Arianna Bisazza
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Natural languages show a tendency to minimize the linear distance between heads and their dependents in a sentence, known as dependency length minimization (DLM). Such a preference, however, has not been consistently replicated with neural agent simulations. Comparing the behavior of models with that of human learners can reveal which aspects affect the emergence of this phenomenon. In this work, we investigate the minimal conditions that may lead neural learners to develop a DLM preference. We add three factors to the standard neural-agent language learning and communication framework to make the simulation more realistic, namely: (i) the presence of noise during listening, (ii) context-sensitivity of word use through non-uniform conditional word distributions, and (iii) incremental sentence processing, or the extent to which an utterance’s meaning can be guessed before hearing it entirely. While no preference appears in production, we show that the proposed factors can contribute to a small but significant learning advantage of DLM for listeners of verb-initial languages.

2022

pdf bib
Hyper-X: A Unified Hypernetwork for Multi-Task Multilingual Transfer
Ahmet Üstün | Arianna Bisazza | Gosse Bouma | Gertjan van Noord | Sebastian Ruder
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Massively multilingual models are promising for transfer learning across tasks and languages. However, existing methods are unable to fully leverage training data when it is available in different task-language combinations. To exploit such heterogeneous supervision, we propose Hyper-X, a single hypernetwork that unifies multi-task and multilingual learning with efficient adaptation. It generates weights for adapter modules conditioned on both tasks and language embeddings. By learning to combine task and language-specific knowledge, our model enables zero-shot transfer for unseen languages and task-language combinations. Our experiments on a diverse set of languages demonstrate that Hyper-X achieves the best or competitive gain when a mixture of multiple resources is available, while on par with strong baseline in the standard scenario. Hyper-X is also considerably more efficient in terms of parameters and resources compared to methods that train separate adapters. Finally, Hyper-X consistently produces strong results in few-shot scenarios for new languages, showing the versatility of our approach beyond zero-shot transfer.

pdf bib
Subword-Delimited Downsampling for Better Character-Level Translation
Lukas Edman | Antonio Toral | Gertjan van Noord
Findings of the Association for Computational Linguistics: EMNLP 2022

Subword-level models have been the dominant paradigm in NLP. However, character-level models have the benefit of seeing each character individually, providing the model with more detailed information that ultimately could lead to better models. Recent works have shown character-level models to be competitive with subword models, but costly in terms of time and computation. Character-level models with a downsampling component alleviate this, but at the cost of quality, particularly for machine translation. This work analyzes the problems of previous downsampling methods and introduces a novel downsampling method which is informed by subwords.This new downsampling method not only outperforms existing downsampling methods, showing that downsampling characters can be done without sacrificing quality, but also leads to promising performance compared to subword models for translation.

pdf bib
UDapter: Typology-based Language Adapters for Multilingual Dependency Parsing and Sequence Labeling
Ahmet Üstün | Arianna Bisazza | Gosse Bouma | Gertjan van Noord
Computational Linguistics, Volume 48, Issue 3 - September 2022

Recent advances in multilingual language modeling have brought the idea of a truly universal parser closer to reality. However, such models are still not immune to the “curse of multilinguality”: Cross-language interference and restrained model capacity remain major obstacles. To address this, we propose a novel language adaptation approach by introducing contextual language adapters to a multilingual parser. Contextual language adapters make it possible to learn adapters via language embeddings while sharing model parameters across languages based on contextual parameter generation. Moreover, our method allows for an easy but effective integration of existing linguistic typology features into the parsing model. Because not all typological features are available for every language, we further combine typological feature prediction with parsing in a multi-task model that achieves very competitive parsing performance without the need for an external prediction system for missing features. The resulting parser, UDapter, can be used for dependency parsing as well as sequence labeling tasks such as POS tagging, morphological tagging, and NER. In dependency parsing, it outperforms strong monolingual and multilingual baselines on the majority of both high-resource and low-resource (zero-shot) languages, showing the success of the proposed adaptation approach. In sequence labeling tasks, our parser surpasses the baseline on high resource languages, and performs very competitively in a zero-shot setting. Our in-depth analyses show that adapter generation via typological features of languages is key to this success.1

pdf bib
Evaluating Pre-training Objectives for Low-Resource Translation into Morphologically Rich Languages
Prajit Dhar | Arianna Bisazza | Gertjan van Noord
Proceedings of the Thirteenth Language Resources and Evaluation Conference

The scarcity of parallel data is a major limitation for Neural Machine Translation (NMT) systems, in particular for translation into morphologically rich languages (MRLs). An important way to overcome the lack of parallel data is to leverage target monolingual data, which is typically more abundant and easier to collect. We evaluate a number of techniques to achieve this, ranging from back-translation to random token masking, on the challenging task of translating English into four typologically diverse MRLs, under low-resource settings. Additionally, we introduce Inflection Pre-Training (or PT-Inflect), a novel pre-training objective whereby the NMT system is pre-trained on the task of re-inflecting lemmatized target sentences before being trained on standard source-to-target language translation. We conduct our evaluation on four typologically diverse target MRLs, and find that PT-Inflect surpasses NMT systems trained only on parallel data. While PT-Inflect is outperformed by back-translation overall, combining the two techniques leads to gains in some of the evaluated language pairs.

2021

pdf bib
The Importance of Context in Very Low Resource Language Modeling
Lukas Edman | Antonio Toral | Gertjan van Noord
Proceedings of the 18th International Conference on Natural Language Processing (ICON)

This paper investigates very low resource language model pretraining, when less than 100 thousand sentences are available. We find that, in very low-resource scenarios, statistical n-gram language models outperform state-of-the-art neural models. Our experiments show that this is mainly due to the focus of the former on a local context. As such, we introduce three methods to improve a neural model’s performance in the low-resource setting, finding that limiting the model’s self-attention is the most effective one, improving on downstream tasks such as NLI and POS tagging by up to 5% for the languages we test on: English, Hindi, and Turkish.

pdf bib
Optimal Word Segmentation for Neural Machine Translation into Dravidian Languages
Prajit Dhar | Arianna Bisazza | Gertjan van Noord
Proceedings of the 8th Workshop on Asian Translation (WAT2021)

Dravidian languages, such as Kannada and Tamil, are notoriously difficult to translate by state-of-the-art neural models. This stems from the fact that these languages are morphologically very rich as well as being low-resourced. In this paper, we focus on subword segmentation and evaluate Linguistically Motivated Vocabulary Reduction (LMVR) against the more commonly used SentencePiece (SP) for the task of translating from English into four different Dravidian languages. Additionally we investigate the optimal subword vocabulary size for each language. We find that SP is the overall best choice for segmentation, and that larger dictionary sizes lead to higher translation quality.

pdf bib
Unsupervised Translation of German–Lower Sorbian: Exploring Training and Novel Transfer Methods on a Low-Resource Language
Lukas Edman | Ahmet Üstün | Antonio Toral | Gertjan van Noord
Proceedings of the Sixth Conference on Machine Translation

This paper describes the methods behind the systems submitted by the University of Groningen for the WMT 2021 Unsupervised Machine Translation task for German–Lower Sorbian (DE–DSB): a high-resource language to a low-resource one. Our system uses a transformer encoder-decoder architecture in which we make three changes to the standard training procedure. First, our training focuses on two languages at a time, contrasting with a wealth of research on multilingual systems. Second, we introduce a novel method for initializing the vocabulary of an unseen language, achieving improvements of 3.2 BLEU for DE->DSB and 4.0 BLEU for DSB->DE.Lastly, we experiment with the order in which offline and online back-translation are used to train an unsupervised system, finding that using online back-translation first works better for DE->DSB by 2.76 BLEU. Our submissions ranked first (tied with another team) for DSB->DE and third for DE->DSB.

2020

pdf bib
Linguistically Motivated Subwords for English-Tamil Translation: University of Groningen’s Submission to WMT-2020
Prajit Dhar | Arianna Bisazza | Gertjan van Noord
Proceedings of the Fifth Conference on Machine Translation

This paper describes our submission for the English-Tamil news translation task of WMT-2020. The various techniques and Neural Machine Translation (NMT) models used by our team are presented and discussed, including back-translation, fine-tuning and word dropout. Additionally, our experiments show that using a linguistically motivated subword segmentation technique (Ataman et al., 2017) does not consistently outperform the more widely used, non-linguistically motivated SentencePiece algorithm (Kudo and Richardson, 2018), despite the agglutinative nature of Tamil morphology.

pdf bib
Data Selection for Unsupervised Translation of German–Upper Sorbian
Lukas Edman | Antonio Toral | Gertjan van Noord
Proceedings of the Fifth Conference on Machine Translation

This paper describes the methods behind the systems submitted by the University of Groningen for the WMT 2020 Unsupervised Machine Translation task for German–Upper Sorbian. We investigate the usefulness of data selection in the unsupervised setting. We find that we can perform data selection using a pretrained model and show that the quality of a set of sentences or documents can have a great impact on the performance of the UNMT system trained on it. Furthermore, we show that document-level data selection should be preferred for training the XLM model when possible. Finally, we show that there is a trade-off between quality and quantity of the data used to train UNMT systems.

pdf bib
A Shared Task of a New, Collaborative Type to Foster Reproducibility: A First Exercise in the Area of Language Science and Technology with REPROLANG2020
António Branco | Nicoletta Calzolari | Piek Vossen | Gertjan Van Noord | Dieter van Uytvanck | João Silva | Luís Gomes | André Moreira | Willem Elbers
Proceedings of the Twelfth Language Resources and Evaluation Conference

n this paper, we introduce a new type of shared task — which is collaborative rather than competitive — designed to support and fosterthe reproduction of research results. We also describe the first event running such a novel challenge, present the results obtained, discussthe lessons learned and ponder on future undertakings.

pdf bib
Low-Resource Unsupervised NMT: Diagnosing the Problem and Providing a Linguistically Motivated Solution
Lukas Edman | Antonio Toral | Gertjan van Noord
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

Unsupervised Machine Translation has been advancing our ability to translate without parallel data, but state-of-the-art methods assume an abundance of monolingual data. This paper investigates the scenario where monolingual data is limited as well, finding that current unsupervised methods suffer in performance under this stricter setting. We find that the performance loss originates from the poor quality of the pretrained monolingual embeddings, and we offer a potential solution: dependency-based word embeddings. These embeddings result in a complementary word representation which offers a boost in performance of around 1.5 BLEU points compared to standard word2vec when monolingual data is limited to 1 million sentences per language. We also find that the inclusion of sub-word information is crucial to improving the quality of the embeddings.

pdf bib
AlpinoGraph: A Graph-based Search Engine for Flexible and Efficient Treebank Search
Peter Kleiweg | Gertjan van Noord
Proceedings of the 19th International Workshop on Treebanks and Linguistic Theories

pdf bib
UDapter: Language Adaptation for Truly Universal Dependency Parsing
Ahmet Üstün | Arianna Bisazza | Gosse Bouma | Gertjan van Noord
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Recent advances in multilingual dependency parsing have brought the idea of a truly universal parser closer to reality. However, cross-language interference and restrained model capacity remain major obstacles. To address this, we propose a novel multilingual task adaptation approach based on contextual parameter generation and adapter modules. This approach enables to learn adapters via language embeddings while sharing model parameters across languages. It also allows for an easy but effective integration of existing linguistic typology features into the parsing network. The resulting parser, UDapter, outperforms strong monolingual and multilingual baselines on the majority of both high-resource and low-resource (zero-shot) languages, showing the success of the proposed adaptation approach. Our in-depth analyses show that soft parameter sharing via typological features is key to this success.

2019

pdf bib
Multi-Team: A Multi-attention, Multi-decoder Approach to Morphological Analysis.
Ahmet Üstün | Rob van der Goot | Gosse Bouma | Gertjan van Noord
Proceedings of the 16th Workshop on Computational Research in Phonetics, Phonology, and Morphology

This paper describes our submission to SIGMORPHON 2019 Task 2: Morphological analysis and lemmatization in context. Our model is a multi-task sequence to sequence neural network, which jointly learns morphological tagging and lemmatization. On the encoding side, we exploit character-level as well as contextual information. We introduce a multi-attention decoder to selectively focus on different parts of character and word sequences. To further improve the model, we train on multiple datasets simultaneously and use external embeddings for initialization. Our final model reaches an average morphological tagging F1 score of 94.54 and a lemma accuracy of 93.91 on the test data, ranking respectively 3rd and 6th out of 13 teams in the SIGMORPHON 2019 shared task.

pdf bib
Cross-Lingual Word Embeddings for Morphologically Rich Languages
Ahmet Üstün | Gosse Bouma | Gertjan van Noord
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

Cross-lingual word embedding models learn a shared vector space for two or more languages so that words with similar meaning are represented by similar vectors regardless of their language. Although the existing models achieve high performance on pairs of morphologically simple languages, they perform very poorly on morphologically rich languages such as Turkish and Finnish. In this paper, we propose a morpheme-based model in order to increase the performance of cross-lingual word embeddings on morphologically rich languages. Our model includes a simple extension which enables us to exploit morphemes for cross-lingual mapping. We applied our model for the Turkish-Finnish language pair on the bilingual word translation task. Results show that our model outperforms the baseline models by 2% in the nearest neighbour ranking.

2018

pdf bib
Simple Embedding-Based Word Sense Disambiguation
Dieke Oele | Gertjan van Noord
Proceedings of the 9th Global Wordnet Conference

We present a simple knowledge-based WSD method that uses word and sense embeddings to compute the similarity between the gloss of a sense and the context of the word. Our method is inspired by the Lesk algorithm as it exploits both the context of the words and the definitions of the senses. It only requires large unlabeled corpora and a sense inventory such as WordNet, and therefore does not rely on annotated data. We explore whether additional extensions to Lesk are compatible with our method. The results of our experiments show that by lexically extending the amount of words in the gloss and context, although it works well for other implementations of Lesk, harms our method. Using a lexical selection method on the context words, on the other hand, improves it. The combination of our method with lexical selection enables our method to outperform state-of the art knowledge-based systems.

pdf bib
Squib: Reproducibility in Computational Linguistics: Are We Willing to Share?
Martijn Wieling | Josine Rawee | Gertjan van Noord
Computational Linguistics, Volume 44, Issue 4 - December 2018

This study focuses on an essential precondition for reproducibility in computational linguistics: the willingness of authors to share relevant source code and data. Ten years after Ted Pedersen’s influential “Last Words” contribution in Computational Linguistics, we investigate to what extent researchers in computational linguistics are willing and able to share their data and code. We surveyed all 395 full papers presented at the 2011 and 2016 ACL Annual Meetings, and identified whether links to data and code were provided. If working links were not provided, authors were requested to provide this information. Although data were often available, code was shared less often. When working links to code or data were not provided in the paper, authors provided the code in about one third of cases. For a selection of ten papers, we attempted to reproduce the results using the provided data and code. We were able to reproduce the results approximately for six papers. For only a single paper did we obtain the exact same results. Our findings show that even though the situation appears to have improved comparing 2016 to 2011, empiricism in computational linguistics still largely remains a matter of faith. Nevertheless, we are somewhat optimistic about the future. Ensuring reproducibility is not only important for the field as a whole, but also seems worthwhile for individual researchers: The median citation count for studies with working links to the source code is higher.

pdf bib
A Taxonomy for In-depth Evaluation of Normalization for User Generated Content
Rob van der Goot | Rik van Noord | Gertjan van Noord
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Modeling Input Uncertainty in Neural Network Dependency Parsing
Rob van der Goot | Gertjan van Noord
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Recently introduced neural network parsers allow for new approaches to circumvent data sparsity issues by modeling character level information and by exploiting raw data in a semi-supervised setting. Data sparsity is especially prevailing when transferring to non-standard domains. In this setting, lexical normalization has often been used in the past to circumvent data sparsity. In this paper, we investigate whether these new neural approaches provide similar functionality as lexical normalization, or whether they are complementary. We provide experimental results which show that a separate normalization component improves performance of a neural network parser even if it has access to character level information as well as external word embeddings. Further improvements are obtained by a straightforward but novel approach in which the top-N best candidates provided by the normalization component are available to the parser.

2017

pdf bib
Increasing Return on Annotation Investment: The Automatic Construction of a Universal Dependency Treebank for Dutch
Gosse Bouma | Gertjan van Noord
Proceedings of the NoDaLiDa 2017 Workshop on Universal Dependencies (UDW 2017)

pdf bib
The Power of Character N-grams in Native Language Identification
Artur Kulmizev | Bo Blankers | Johannes Bjerva | Malvina Nissim | Gertjan van Noord | Barbara Plank | Martijn Wieling
Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications

In this paper, we explore the performance of a linear SVM trained on language independent character features for the NLI Shared Task 2017. Our basic system (GRONINGEN) achieves the best performance (87.56 F1-score) on the evaluation set using only 1-9 character n-grams as features. We compare this against several ensemble and meta-classifiers in order to examine how the linear system fares when combined with other, especially non-linear classifiers. Special emphasis is placed on the topic bias that exists by virtue of the assessment essay prompt distribution.

pdf bib
Distributional Lesk: Effective Knowledge-Based Word Sense Disambiguation
Dieke Oele | Gertjan van Noord
Proceedings of the 12th International Conference on Computational Semantics (IWCS) — Short papers

pdf bib
Parser Adaptation for Social Media by Integrating Normalization
Rob van der Goot | Gertjan van Noord
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

This work explores different approaches of using normalization for parser adaptation. Traditionally, normalization is used as separate pre-processing step. We show that integrating the normalization model into the parsing algorithm is more beneficial. This way, multiple normalization candidates can be leveraged, which improves parsing performance on social media. We test this hypothesis by modifying the Berkeley parser; out-of-the-box it achieves an F1 score of 66.52. Our integrated approach reaches a significant improvement with an F1 score of 67.36, while using the best normalization sequence results in an F1 score of only 66.94.

2016

pdf bib
Bilingual Learning of Multi-sense Embeddings with Discrete Autoencoders
Simon Šuster | Ivan Titov | Gertjan van Noord
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
SMT and Hybrid systems of the QTLeap project in the WMT16 IT-task
Rosa Gaudio | Gorka Labaka | Eneko Agirre | Petya Osenova | Kiril Simov | Martin Popel | Dieke Oele | Gertjan van Noord | Luís Gomes | João António Rodrigues | Steven Neale | João Silva | Andreia Querido | Nuno Rendeiro | António Branco
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

pdf bib
Proceedings of the 2nd Deep Machine Translation Workshop
Jan Hajič | Gertjan van Noord | António Branco
Proceedings of the 2nd Deep Machine Translation Workshop

pdf bib
Obituary: In Memoriam: Susan Armstrong
Pierrette Bouillon | Paola Merlo | Gertjan van Noord | Mike Rosner
Computational Linguistics, Volume 42, Issue 2 - June 2016

2015

pdf bib
ROB: Using Semantic Meaning to Recognize Paraphrases
Rob van der Goot | Gertjan van Noord
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

pdf bib
Comparison of Coreference Resolvers for Deep Syntax Translation
Michal Novák | Dieke Oele | Gertjan van Noord
Proceedings of the Second Workshop on Discourse in Machine Translation

pdf bib
Lexical choice in Abstract Dependency Trees
Dieke Oele | Gertjan van Noord
Proceedings of the 1st Deep Machine Translation Workshop

2014

pdf bib
Treelet Probabilities for HPSG Parsing and Error Correction
Angelina Ivanova | Gertjan van Noord
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Most state-of-the-art parsers take an approach to produce an analysis for any input despite errors. However, small grammatical mistakes in a sentence often cause parser to fail to build a correct syntactic tree. Applications that can identify and correct mistakes during parsing are particularly interesting for processing user-generated noisy content. Such systems potentially could take advantage of linguistic depth of broad-coverage precision grammars. In order to choose the best correction for an utterance, probabilities of parse trees of different sentences should be comparable which is not supported by discriminative methods underlying parsing software for processing deep grammars. In the present work we assess the treelet model for determining generative probabilities for HPSG parsing with error correction. In the first experiment the treelet model is applied to the parse selection task and shows superior exact match accuracy than the baseline and PCFG. In the second experiment it is tested for the ability to score the parse tree of the correct sentence higher than the constituency tree of the original version of the sentence containing grammatical error.

pdf bib
From neighborhood to parenthood: the advantages of dependency representation over bigrams in Brown clustering
Simon Šuster | Gertjan van Noord
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

2011

pdf bib
Effective Measures of Domain Similarity for Parsing
Barbara Plank | Gertjan van Noord
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Reversible Stochastic Attribute-Value Grammars
Daniël de Kok | Barbara Plank | Gertjan van Noord
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

pdf bib
An Empirical Comparison of Unknown Word Prediction Methods
Kostadin Cholakov | Gertjan van Noord | Valia Kordoni | Yi Zhang
Proceedings of 5th International Joint Conference on Natural Language Processing

pdf bib
Adaptability of Lexical Acquisition for Large-scale Grammars
Kostadin Cholakov | Gertjan van Noord | Valia Kordoni | Yi Zhang
Proceedings of the International Conference Recent Advances in Natural Language Processing 2011

2010

pdf bib
Using Unknown Word Techniques to Learn Known Words
Kostadin Cholakov | Gertjan van Noord
Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing

pdf bib
Grammar-Driven versus Data-Driven: Which Parsing System Is More Affected by Domain Shifts?
Barbara Plank | Gertjan van Noord
Proceedings of the 2010 Workshop on NLP and Linguistics: Finding the Common Ground

pdf bib
Acquisition of Unknown Word Paradigms for Large-Scale Grammars
Kostadin Cholakov | Gertjan van Noord
Coling 2010: Posters

pdf bib
POS Multi-tagging Based on Combined Models
Yan Zhao | Gertjan van Noord
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

In the POS tagging task, there are two kinds of statistical models: one is generative model, such as the HMM, the others are discriminative models, such as the Maximum Entropy Model (MEM). POS multi-tagging decoding method includes the N-best paths method and forward-backward method. In this paper, we use the forward-backward decoding method based on a combined model of HMM and MEM. If P(t) is the forward-backward probability of each possible tag t, we first calculate P(t) according HMM and MEM separately. For all tags options in a certain position in a sentence, we normalize P(t) in HMM and MEM separately. Probability of the combined model is the sum of normalized forward-backward probabilities P norm(t) in HMM and MEM. For each word w, we select the best tag in which the probability of combined model is the highest. In the experiments, we use combined model and get higher accuracy than any single model on POS tagging tasks of three languages, which are Chinese, English and Dutch. The result indicates that our combined model is effective.

2009

pdf bib
Learning Efficient Parsing
Gertjan van Noord
Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009)

pdf bib
Combining Finite State and Corpus-based Techniques for Unknown Word Prediction
Kostadin Cholakov | Gertjan van Noord
Proceedings of the International Conference RANLP-2009

pdf bib
Parsed Corpora for Linguistics
Gertjan van Noord | Gosse Bouma
Proceedings of the EACL 2009 Workshop on the Interaction between Linguistics and Computational Linguistics: Virtuous, Vicious or Vacuous?

pdf bib
A generalized method for iterative error mining in parsing results
Daniël de Kok | Jianqiang Ma | Gertjan van Noord
Proceedings of the 2009 Workshop on Grammar Engineering Across Frameworks (GEAF 2009)

2008

pdf bib
Exploring an Auxiliary Distribution Based Approach to Domain Adaptation of a Syntactic Disambiguation Model
Barbara Plank | Gertjan van Noord
Coling 2008: Proceedings of the workshop on Cross-Framework and Cross-Domain Parser Evaluation

pdf bib
From D-Coi to SoNaR: a reference corpus for Dutch
Nelleke Oostdijk | Martin Reynaert | Paola Monachesi | Gertjan Van Noord | Roeland Ordelman | Ineke Schuurman | Vincent Vandeghinste
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

The computational linguistics community in The Netherlands and Belgium has long recognized the dire need for a major reference corpus of written Dutch. In part to answer this need, the STEVIN programme was established. To pave the way for the effective building of a 500-million-word reference corpus of written Dutch, a pilot project was established. The Dutch Corpus Initiative project or D-Coi was highly successful in that it not only realized about 10% of the projected large reference corpus, but also established the best practices and developed all the protocols and the necessary tools for building the larger corpus within the confines of a necessarily limited budget. We outline the steps involved in an endeavour of this kind, including the major highlights and possible pitfalls. Once converted to a suitable XML format, further linguistic annotation based on the state-of-the-art tools developed either before or during the pilot by the consortium partners proved easily and fruitfully applicable. Linguistic enrichment of the corpus includes PoS tagging, syntactic parsing and semantic annotation, involving both semantic role labeling and spatiotemporal annotation. D-Coi is expected to be followed by SoNaR, during which the 500-million-word reference corpus of Dutch should be built.

2007

pdf bib
ACL 2007 Workshop on Deep Linguistic Processing
Timothy Baldwin | Mark Dras | Julia Hockenmaier | Tracy Holloway King | Gertjan van Noord
ACL 2007 Workshop on Deep Linguistic Processing

pdf bib
Using Self-Trained Bilexical Preferences to Improve Disambiguation Accuracy
Gertjan van Noord
Proceedings of the Tenth International Conference on Parsing Technologies

pdf bib
The Impact of Deep Linguistic Processing on Parsing Technology
Timothy Baldwin | Mark Dras | Julia Hockenmaier | Tracy Holloway King | Gertjan van Noord
Proceedings of the Tenth International Conference on Parsing Technologies

2006

pdf bib
Syntactic Annotation of Large Corpora in STEVIN
Gertjan van Noord | Ineke Schuurman | Vincent Vandeghinste
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

The construction of a 500-million-word reference corpus of written Dutch has been identified as one of the priorities in the Dutch/Flemish STEVIN programme. For part of this corpus, manually corrected syntactic annotations will be provided. The paper presents the background of the syntactic annotation efforts, the Alpino parser which is used as an important tool for constructing the syntactic annotations, as well as a number of other annotation tools and guidelines. For the full STEVIN corpus, automatically derived syntactic annotations will be provided in a later phase of the programme. A number of arguments is provided suggesting that such a resource can be very useful for applications in information extraction, ontology building, lexical acquisition, machine translation and corpus linguistics.

pdf bib
At Last Parsing Is Now Operational
Gertjan van Noord
Actes de la 13ème conférence sur le Traitement Automatique des Langues Naturelles. Conférences invitées

Natural language analysis systems which combine knowledge-based and corpus-based methods are now becoming accurate enough to be used in various applications. We describe one such parsing system for Dutch, known as Alpino, and we show how corpus-based methods are essential to obtain accurate knowledge-based parsers. In particular we show a variety of cases where large amounts of parser output are used to improve the parser.

pdf bib
Robust Parsing, Error Mining, Automated Lexical Acquisition, and Evaluation
Gertjan van Noord
Proceedings of the Workshop on ROMAND 2006:Robust Methods in Analysis of Natural language Data

2004

pdf bib
Error Mining for Wide-Coverage Grammar Engineering
Gertjan van Noord
Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04)

2001

pdf bib
Unsupervised POS-Tagging Improves Parsing Accuracy and Parsing Efficiency
Robbert Prins | Gertjan van Noord
Proceedings of the Seventh International Workshop on Parsing Technologies

2000

pdf bib
Approximation and Exactness in Finite State Optimality Theory
Dale Gerdemann | Gertjan van Noord
Proceedings of the Fifth Workshop of the ACL Special Interest Group in Computational Phonology

pdf bib
Treatment of epsilon moves in subset construction
Gertjan van Noord
Computational Linguistics, Volume 26, Number 1, March 2000

1999

pdf bib
Transducers from Rewrite Rules with Backreferences
Dale Gerdemann | Gertjan van Noord
Ninth Conference of the European Chapter of the Association for Computational Linguistics

1998

pdf bib
Treatment of e-Moves in Subset Construction
Gertjan van Noord
Finite State Methods in Natural Language Processing

1997

pdf bib
Grammatical analysis in the OVIS spoken-dialogue system
Mark-Jan Nederhof | Gosse Bouma | Rob Koeling | Gertjan van Noord
Interactive Spoken Dialog Systems: Bringing Speech and NLP Together in Real Applications

pdf bib
Hdrug. A Flexible and Extendible Development Environment for Natural Language Processing.
Gertjan van Noord | Gosse Bouma
Computational Environments for Grammar Development and Linguistic Engineering

pdf bib
An Efficient Implementation of the Head-Corner Parser
Gertjan van Noord
Computational Linguistics, Volume 23, Number 3, September 1997

1995

pdf bib
The intersection of Finite State Automata and Definite Clause Grammars
Gertjan van Noord
33rd Annual Meeting of the Association for Computational Linguistics

1994

pdf bib
Adjuncts and the Processing of Lexical Rules
Gertjan van Noord | Gosse Bouma
COLING 1994 Volume 1: The 15th International Conference on Computational Linguistics

pdf bib
Constraint-Based Categorial Grammar
Gosse Bouma | Gertjan van Noord
32nd Annual Meeting of the Association for Computational Linguistics

1993

pdf bib
Head-driven Parsing for Lexicalist Grammars: Experimental Results
Gosse Bouma | Gertjan van Noord
Sixth Conference of the European Chapter of the Association for Computational Linguistics

1992

pdf bib
Self-Monitoring with Reversible Grammars
Gunter Neumann | Gertjan van Noord
COLING 1992 Volume 2: The 14th International Conference on Computational Linguistics

1991

pdf bib
Towards Uniform Processing of Constraint-based Categorial Grammars
Gertjan van Noord
Reversible Grammar in Natural Language Processing

pdf bib
Head Corner Parsing for Discontinuous Constituency
Gertjan van Noord
29th Annual Meeting of the Association for Computational Linguistics

1990

pdf bib
Reversible Unification Based Machine Translation
Gertjan van Noord
COLING 1990 Volume 2: Papers presented to the 13th International Conference on Computational Linguistics

pdf bib
Semantic-Head-Driven Generation
Stuart M. Shieber | Gertjan van Noord | Fernando C. N. Pereira | Robert C. Moore
Computational Linguistics, Volume 16, Number 1, March 1990

1989

pdf bib
A Semantic-Head-Driven Generation Algorithm for Unification-Based Formalisms
Stuart M. Shieber | Gertjan van Noord | Robert C. Moore | Fernando C. N. Pereira
27th Annual Meeting of the Association for Computational Linguistics

pdf bib
An Approach to Sentence-Level Anaphora in Machine Translation
Gertjan van Noord | Joke Dorrepaal | Doug Arnold | Steven Krauwer | Louisa Sadler | Louis des Tombe
Fourth Conference of the European Chapter of the Association for Computational Linguistics