Bruno Pouliquen


2016

pdf bib
The United Nations Parallel Corpus v1.0
Michał Ziemski | Marcin Junczys-Dowmunt | Bruno Pouliquen
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper describes the creation process and statistics of the official United Nations Parallel Corpus, the first parallel corpus composed from United Nations documents published by the original data creator. The parallel corpus presented consists of manually translated UN documents from the last 25 years (1990 to 2014) for the six official UN languages, Arabic, Chinese, English, French, Russian, and Spanish. The corpus is freely available for download under a liberal license. Apart from the pairwise aligned documents, a fully aligned subcorpus for the six official UN languages is distributed. We provide baseline BLEU scores of our Moses-based SMT systems trained with the full data of language pairs involving English and for all possible translation directions of the six-way subcorpus.

pdf bib
Keynote Lecture 1: Practical Use of Machine Translation in International Organizations
Bruno Pouliquen
Proceedings of the 13th International Conference on Natural Language Processing

2015

pdf bib
Full-text patent translation at WIPO; scalability, quality and usability
Bruno Pouliquen
Proceedings of the 6th Workshop on Patent and Scientific Literature Translation

pdf bib
SMT at the International Maritime Organization experiences with combining in-house corpus with more general corpus
Bruno Pouliquen | Marcin Junczys-Dowmunt | Blanca Pinero | Michał Ziemski
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

pdf bib
SMT at the International Maritime Organization: experiences with combining in-house corpora with out-of-domain corpora
Bruno Pouliquen | Marcin Junczys-Dowmunt | Blanca Pinero | Michal Ziemski
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

2014

pdf bib
SMT of German patents at WIPO: decompounding and verb structure pre-reordering
Marcin Junczys-Dowmunt | Bruno Pouliquen
Proceedings of the 17th Annual conference of the European Association for Machine Translation

2013

pdf bib
Large-scale Multiple Language Translation Accelerator at the United Nations
Bruno Pouliquen | Cecilia Elizalde | Marcin Junczys-Dowmunt | Christophe Mazenc | Jose Garcia-Verdugo
Proceedings of Machine Translation Summit XIV: User track

2012

pdf bib
Statistical Machine Translation prototype using UN parallel documents
Bruno Pouliquen | Christophe Mazenc | Cecilia Elizalde | Jose Garcia-Verdugo
Proceedings of the 16th Annual conference of the European Association for Machine Translation

pdf bib
TAPTA4UN: collaboration on machine translation between the World Intellectual Property Organization and the United Nations
Cecilia Elizalde | Bruno Pouliquen | Christophe Mazenc | José García-Verdugo
Proceedings of Translating and the Computer 34

2011

pdf bib
JRC-NAMES: A Freely Available, Highly Multilingual Named Entity Resource
Ralf Steinberger | Bruno Pouliquen | Mijail Kabadjov | Jenya Belyaeva | Erik van der Goot
Proceedings of the International Conference Recent Advances in Natural Language Processing 2011

pdf bib
COPPA, CLIR and TAPTA: three tools to assist in overcoming the Patent language barrier at WIPO
Bruno Pouliquen
Proceedings of Machine Translation Summit XIII: Plenaries

pdf bib
Tapta: A user-driven translation system for patent documents based on domain-aware Statistical Machine Translation
Bruno Pouliquen | Christophe Mazenc | Aldo Iorio
Proceedings of the 15th Annual conference of the European Association for Machine Translation

pdf bib
Statistical Machine Translation
Bruno Pouliquen | Christophe Mazenc | Aldo Iorio
Proceedings of the 15th Annual conference of the European Association for Machine Translation

pdf bib
Automatic translation tools at WIPO
Bruno Pouliquen | Christophe Mazenc
Proceedings of Translating and the Computer 33

2010

pdf bib
Adapting a resource-light highly multilingual Named Entity Recognition system to Arabic
Wajdi Zaghouani | Bruno Pouliquen | Mohamed Ebrahim | Ralf Steinberger
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We present a fully functional Arabic information extraction (IE) system that is used to analyze large volumes of news texts every day to extract the named entity (NE) types person, organization, location, date and number, as well as quotations (direct reported speech) by and about people. The Named Entity Recognition (NER) system was not developed for Arabic, but - instead - a highly multilingual, almost language-independent NER system was adapted to also cover Arabic. The Semitic language Arabic substantially differs from the Indo-European and Finno-Ugric languages currently covered. This paper thus describes what Arabic language-specific resources had to be developed and what changes needed to be made to the otherwise language-independent rule set in order to be applicable to the Arabic language. The achieved evaluation results are generally satisfactory, but could be improved for certain entity types. The results of the IE tools can be seen on the Arabic pages of the freely accessible Europe Media Monitor (EMM) application NewsExplorer, which can be found at http://press.jrc.it/overview.html.

pdf bib
Sentiment Analysis in the News
Alexandra Balahur | Ralf Steinberger | Mijail Kabadjov | Vanni Zavarella | Erik van der Goot | Matina Halkia | Bruno Pouliquen | Jenya Belyaeva
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Recent years have brought a significant growth in the volume of research in sentiment analysis, mostly on highly subjective text types (movie or product reviews). The main difference these texts have with news articles is that their target is clearly defined and unique across the text. Following different annotation efforts and the analysis of the issues encountered, we realised that news opinion mining is different from that of other text types. We identified three subtasks that need to be addressed: definition of the target; separation of the good and bad news content from the good and bad sentiment expressed on the target; and analysis of clearly marked opinion that is expressed explicitly, not needing interpretation or the use of world knowledge. Furthermore, we distinguish three different possible views on newspaper articles ― author, reader and text, which have to be addressed differently at the time of analysing sentiment. Given these definitions, we present work on mining opinions about entities in English language news, in which we apply these concepts. Results showed that this idea is more appropriate in the context of news opinion mining and that the approaches taking this into consideration produce a better performance.

2008

pdf bib
Online-Monitoring of Security-Related Events
Martin Atkinson | Jakub Piskorski | Bruno Pouliquen | Ralf Steinberger | Hristo Tanev | Vanni Zavarella
Coling 2008: Companion volume: Demonstrations

pdf bib
Story tracking: linking similar news over time and across languages
Bruno Pouliquen | Ralf Steinberger | Olivier Deguernel
Coling 2008: Proceedings of the workshop Multi-source Multilingual Information Extraction and Summarization

2006

pdf bib
The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages
Ralf Steinberger | Bruno Pouliquen | Anna Widiger | Camelia Ignat | Tomaž Erjavec | Dan Tufiş | Dániel Varga
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

We present a new, unique and freely available parallel corpus containing European Union (EU) documents of mostly legal nature. It is available in all 20 official EU languages, with additional documents being available in the languages of the EU candidate countries. The corpus consists of almost 8,000 documents per language, with an average size of nearly 9 million words per language. Pair-wise paragraph alignment information produced by two different aligners (Vanilla and HunAlign) is available for all 190+ language pair combinations. Most texts have been manually classified according to the EUROVOC subject domains so that the collection can also be used to train and test multi-label classification algorithms and keyword-assignment software. The corpus is encoded in XML, according to the Text Encoding Initiative Guidelines. Due to the large number of parallel texts in many languages, the JRC-Acquis is particularly suitable to carry out all types of cross-language research, as well as to test and benchmark text analysis software across different languages (for instance for alignment, sentence splitting and term extraction).

pdf bib
Geocoding Multilingual Texts: Recognition, Disambiguation and Visualisation
Bruno Pouliquen | Marco Kimler | Ralf Steinberger | Camelia Ignat | Tamara Oellinger | Ken Blackler | Flavio Fluart | Wajdi Zaghouani | Anna Widiger | Ann-Charlotte Forslund | Clive Best
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

We are presenting a method to recognise geographical references in free text. Our tool must work on various languages with a minimum of language-dependent resources, except a gazetteer. The main difficulty is to disambiguate these place names by distinguishing places from persons and by selecting the most likely place out of a list of homographic place names world-wide. The system uses a number of language-independent clues and heuristics to disambiguate place name homographs. The final aim is to index texts with the countries and cities they mention and to automatically visualise this information on geographical maps using various tools.

2004

pdf bib
Multilingual and cross-lingual news topic tracking
Bruno Pouliquen | Ralf Steinberger | Camelia Ignat | Emilia Käsper | Irina Temnikova
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics