Gideon Maillette de Buy Wenniger


2020

pdf bib
SChuBERT: Scholarly Document Chunks with BERT-encoding boost Citation Count Prediction.
Thomas van Dongen | Gideon Maillette de Buy Wenniger | Lambert Schomaker
Proceedings of the First Workshop on Scholarly Document Processing

Predicting the number of citations of scholarly documents is an upcoming task in scholarly document processing. Besides the intrinsic merit of this information, it also has a wider use as an imperfect proxy for quality which has the advantage of being cheaply available for large volumes of scholarly documents. Previous work has dealt with number of citations prediction with relatively small training data sets, or larger datasets but with short, incomplete input text. In this work we leverage the open access ACL Anthology collection in combination with the Semantic Scholar bibliometric database to create a large corpus of scholarly documents with associated citation information and we propose a new citation prediction model called SChuBERT. In our experiments we compare SChuBERT with several state-of-the-art citation prediction models and show that it outperforms previous methods by a large margin. We also show the merit of using more training data and longer input for number of citations prediction.

pdf bib
Structure-Tags Improve Text Classification for Scholarly Document Quality Prediction
Gideon Maillette de Buy Wenniger | Thomas van Dongen | Eleri Aedmaa | Herbert Teun Kruitbosch | Edwin A. Valentijn | Lambert Schomaker
Proceedings of the First Workshop on Scholarly Document Processing

Training recurrent neural networks on long texts, in particular scholarly documents, causes problems for learning. While hierarchical attention networks (HANs) are effective in solving these problems, they still lose important information about the structure of the text. To tackle these problems, we propose the use of HANs combined with structure-tags which mark the role of sentences in the document. Adding tags to sentences, marking them as corresponding to title, abstract or main body text, yields improvements over the state-of-the-art for scholarly document quality prediction. The proposed system is applied to the task of accept/reject prediction on the PeerRead dataset and compared against a recent BiLSTM-based model and joint textual+visual model as well as against plain HANs. Compared to plain HANs, accuracy increases on all three domains.On the computation and language domain our new model works best overall, and increases accuracy 4.7% over the best literature result. We also obtain improvements when introducing the tags for prediction of the number of citations for 88k scientific publications that we compiled from the Allen AI S2ORC dataset. For our HAN-system with structure-tags we reach 28.5% explained variance, an improvement of 1.8% over our reimplementation of the BiLSTM-based model as well as 1.0% improvement over plain HANs.

2019

pdf bib
Combining PBSMT and NMT Back-translated Data for Efficient NMT
Alberto Poncelas | Maja Popović | Dimitar Shterionov | Gideon Maillette de Buy Wenniger | Andy Way
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

Neural Machine Translation (NMT) models achieve their best performance when large sets of parallel data are used for training. Consequently, techniques for augmenting the training set have become popular recently. One of these methods is back-translation, which consists on generating synthetic sentences by translating a set of monolingual, target-language sentences using a Machine Translation (MT) model. Generally, NMT models are used for back-translation. In this work, we analyze the performance of models when the training data is extended with synthetic data using different MT approaches. In particular we investigate back-translated data generated not only by NMT but also by Statistical Machine Translation (SMT) models and combinations of both. The results reveal that the models achieve the best performances when the training set is augmented with back-translated data created by merging different MT approaches.

pdf bib
Transductive Data-Selection Algorithms for Fine-Tuning Neural Machine Translation
Alberto Poncelas | Gideon Maillette de Buy Wenniger | Andy Way
Proceedings of The 8th Workshop on Patent and Scientific Literature Translation

2018

pdf bib
Data Selection with Feature Decay Algorithms Using an Approximated Target Side
Alberto Poncelas | Gideon Maillette de Buy Wenniger | Andy Way
Proceedings of the 15th International Conference on Spoken Language Translation

Data selection techniques applied to neural machine translation (NMT) aim to increase the performance of a model by retrieving a subset of sentences for use as training data. One of the possible data selection techniques are transductive learning methods, which select the data based on the test set, i.e. the document to be translated. A limitation of these methods to date is that using the source-side test set does not by itself guarantee that sentences are selected with correct translations, or translations that are suitable given the test-set domain. Some corpora, such as subtitle corpora, may contain parallel sentences with inaccurate translations caused by localization or length restrictions. In order to try to fix this problem, in this paper we propose to use an approximated target-side in addition to the source-side when selecting suitable sentence-pairs for training a model. This approximated target-side is built by pre-translating the source-side. In this work, we explore the performance of this general idea for one specific data selection approach called Feature Decay Algorithms (FDA). We train German-English NMT models on data selected by using the test set (source), the approximated target side, and a mixture of both. Our findings reveal that models built using a combination of outputs of FDA (using the test set and an approximated target side) perform better than those solely using the test set. We obtain a statistically significant improvement of more than 1.5 BLEU points over a model trained with all data, and more than 0.5 BLEU points over a strong FDA baseline that uses source-side information only.

2014

pdf bib
Bilingual Markov Reordering Labels for Hierarchical SMT
Gideon Maillette de Buy Wenniger | Khalil Sima’an
Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation

2013

pdf bib
Hierarchical Alignment Decomposition Labels for Hiero Grammar Rules
Gideon Maillette de Buy Wenniger | Khalil Sima’an
Proceedings of the Seventh Workshop on Syntax, Semantics and Structure in Statistical Translation

pdf bib
A Formal Characterization of Parsing Word Alignments by Synchronous Grammars with Empirical Evidence to the ITG Hypothesis.
Gideon Maillette de Buy Wenniger | Khalil Sima’an
Proceedings of the Seventh Workshop on Syntax, Semantics and Structure in Statistical Translation