Pierre Nugues

2025

pdf bib abs
Matching and Linking Entries in Historical Swedish Encyclopedias
Simon Börjesson | Erik Ersmark | Pierre Nugues
Proceedings of the 9th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2025)

The Nordisk familjebok is a Swedish encyclopedia from the 19th and 20th centuries. It was written by a team of experts and aimed to be an intellectual reference, stressing precision and accuracy. This encyclopedia had four main editions remarkable by their size, ranging from 20 to 38 volumes. As a consequence, the Nordisk familjebok had a considerable influence in universities, schools, the media, and society overall. As new editions were released, the selection of entries and their content evolved, reflecting intellectual changes in Sweden.In this paper, we used digitized versions from Project Runeberg. We first resegmented the raw text into entries and matched pairs of entries between the first and second editions using semantic sentence embeddings. We then extracted the geographical entries from both editions using a transformer-based classifier and linked them to Wikidata. This enabled us to identify geographic trends and possible shifts between the first and second editions, written between 1876–1899 and 1904–1926, respectively.Interpreting the results, we observe a small but significant shift in geographic focus away from Europe and towards North America, Africa, Asia, Australia, and northern Scandinavia from the first to the second edition, confirming the influence of the First World War and the rise of new powers. The code and data are available on GitHub at https://github.com/sibbo/nordisk-familjebok.

2024

pdf bib abs
Linking Named Entities in Diderot’s Encyclopédie to Wikidata
Pierre Nugues
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Diderot’s Encyclopédie is a reference work from XVIIIth century in Europe that aimed at collecting the knowledge of its era. Wikipedia has the same ambition with a much greater scope. However, the lack of digital connection between the two encyclopedias may hinder their comparison and the study of how knowledge has evolved. A key element of Wikipedia is Wikidata that backs the articles with a graph of structured data. In this paper, we describe the annotation of more than 9,100 of the Encyclopédie entries with Wikidata identifiers enabling us to connect these entries to the graph. We considered geographic and human entities. The Encyclopédie does not contain biographic entries as they mostly appear as subentries of locations. We extracted all the geographic entries and we completely annotated all the entries containing a description of human entities. This represents more than 2,600 links referring to locations or human entities. In addition, we annotated more than 8,300 entries having a geographic content only. We describe the annotation process as well as application examples. This resource is available at https://github.com/pnugues/encyclopedie_1751.

pdf bib abs
Mapping the Past: Geographically Linking an Early 20th Century Swedish Encyclopedia with Wikidata
Axel Ahlin | Alfred Myrne Blåder | Pierre Nugues
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

In this paper, we describe the extraction of all the location entries from a prominent Swedish encyclopedia from the early 20th century, the Nordisk Familjebok ‘Nordic Family Book’, focusing on the second edition called Uggleupplagan. This edition comprises 38 volumes and over 182,000 articles, making it one of the most extensive Swedish encyclopedia editions. Using a classifier, we first determined the category of the entities. We found that approximately 22 percent of the encyclopedia entries were locations. We applied a named entity recognition to these entries and we linked them to Wikidata. Wikidata enabled us to extract their precise geographic locations resulting in almost 18,000 valid coordinates. We then analyzed the distribution of these locations and the entry selection process. It showed a concentration within Sweden, Germany, and the United Kingdom. The paper sheds light on the selection and representation of geographic information in the Nordisk Familjebok, providing insights into historical and societal perspectives. It also paves the way for future investigations into entry selection in different time periods and comparative analyses among various encyclopedias.

2022

pdf bib
Arabic Image Captioning using Pre-training of Deep Bidirectional Transformers
Jonathan Emami | Pierre Nugues | Ashraf Elnagar | Imad Afyouni
Proceedings of the 15th International Conference on Natural Language Generation

pdf bib abs
Connecting a French Dictionary from the Beginning of the 20th Century to Wikidata
Pierre Nugues
Proceedings of the Thirteenth Language Resources and Evaluation Conference

The Petit Larousse illustré is a French dictionary first published in 1905. Its division in two main parts on language and on history and geography corresponds to a major milestone in French lexicography as well as a repository of general knowledge from this period. Although the value of many entries from 1905 remains intact, some descriptions now have a dimension that is more historical than contemporary. They are nonetheless significant to analyze and understand cultural representations from this time. A comparison with more recent information or a verification of these entries would require a tedious manual work. In this paper, we describe a new lexical resource, where we connected all the dictionary entries of the history and geography part to current data sources. For this, we linked each of these entries to a wikidata identifier. Using the wikidata links, we can automate more easily the identification, comparison, and verification of historically-situated representations. We give a few examples on how to process wikidata identifiers and we carried out a small analysis of the entities described in the dictionary to outline possible applications. The resource, i.e. the annotation of 20,245 dictionary entries with wikidata links, is available from GitHub (https://github.com/pnugues/petit_larousse_1905/)

2020

pdf bib abs
Hedwig: A Named Entity Linker
Marcus Klang | Pierre Nugues
Proceedings of the Twelfth Language Resources and Evaluation Conference

Named entity linking is the task of identifying mentions of named things in text, such as “Barack Obama” or “New York”, and linking these mentions to unique identifiers. In this paper, we describe Hedwig, an end-to-end named entity linker, which uses a combination of word and character BILSTM models for mention detection, a Wikidata and Wikipedia-derived knowledge base with global information aggregated over nine language editions, and a PageRank algorithm for entity linking. We evaluated Hedwig on the TAC2017 dataset, consisting of news texts and discussion forums, and we obtained a final score of 59.9% on CEAFmC+, an improvement over our previous generation linker Ugglan, and a trilingual entity link score of 71.9%.

2019

pdf bib abs
Docria: Processing and Storing Linguistic Data with Wikipedia
Marcus Klang | Pierre Nugues
Proceedings of the 22nd Nordic Conference on Computational Linguistics

The availability of user-generated content has increased significantly over time. Wikipedia is one example of a corpora which spans a huge range of topics and is freely available. Storing and processing these corpora requires flexible documents models as they may contain malicious and incorrect data. Docria is a library which attempts to address this issue by providing a solution which can be used with small to large corpora, from laptops using Python interactively in a Jupyter notebook to clusters running map-reduce frameworks with optimized compiled code. Docria is available as open-source code.

In this paper, we investigate the annotation projection of semantic units in a practical setting. Previous approaches have focused on using parallel corpora for semantic transfer. We evaluate an alternative approach using loosely parallel corpora that does not require the corpora to be exact translations of each other. We developed a method that transfers semantic annotations from one language to another using sentences aligned by entities, and we extended it to include alignments by entity-like linguistic units. We conducted our experiments on a large scale using the English, Swedish, and French language editions of Wikipedia. Our results show that the annotation projection using entities in combination with loosely parallel corpora provides a viable approach to extending previous attempts. In addition, it allows the generation of proposition banks upon which semantic parsers can be trained.

pdf bib abs
Langforia: Language Pipelines for Annotating Large Collections of Documents
Marcus Klang | Pierre Nugues
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations

In this paper, we describe Langforia, a multilingual processing pipeline to annotate texts with multiple layers: formatting, parts of speech, named entities, dependencies, semantic roles, and entity links. Langforia works as a web service, where the server hosts the language processing components and the client, the input and result visualization. To annotate a text or a Wikipedia page, the user chooses an NLP pipeline and enters the text in the interface or selects the page URL. Once processed, the results are returned to the client, where the user can select the annotation layers s/he wants to visualize. We designed Langforia with a specific focus for Wikipedia, although it can process any type of text. Wikipedia has become an essential encyclopedic corpus used in many NLP projects. However, processing articles and visualizing the annotations are nontrivial tasks that require dealing with multiple markup variants, encodings issues, and tool incompatibilities across the language versions. This motivated the development of a new architecture. A demonstration of Langforia is available for six languages: English, French, German, Spanish, Russian, and Swedish at http://vilde.cs.lth.se:9000/ as well as a web API: http://vilde.cs.lth.se:9000/api. Langforia is also provided as a standalone library and is compatible with cluster computing.

pdf bib abs
WIKIPARQ: A Tabulated Wikipedia Resource Using the Parquet Format
Marcus Klang | Pierre Nugues
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Wikipedia has become one of the most popular resources in natural language processing and it is used in quantities of applications. However, Wikipedia requires a substantial pre-processing step before it can be used. For instance, its set of nonstandardized annotations, referred to as the wiki markup, is language-dependent and needs specific parsers from language to language, for English, French, Italian, etc. In addition, the intricacies of the different Wikipedia resources: main article text, categories, wikidata, infoboxes, scattered into the article document or in different files make it difficult to have global view of this outstanding resource. In this paper, we describe WikiParq, a unified format based on the Parquet standard to tabulate and package the Wikipedia corpora. In combination with Spark, a map-reduce computing framework, and the SQL query language, WikiParq makes it much easier to write database queries to extract specific information or subcorpora from Wikipedia, such as all the first paragraphs of the articles in French, or all the articles on persons in Spanish, or all the articles on persons that have versions in French, English, and Spanish. WikiParq is available in six language versions and is potentially extendible to all the languages of Wikipedia. The WikiParq files are downloadable as tarball archives from this location: http://semantica.cs.lth.se/wikiparq/.

pdf bib abs
Pairing Wikipedia Articles Across Languages
Marcus Klang | Pierre Nugues
Proceedings of the Open Knowledge Base and Question Answering Workshop (OKBQA 2016)

Wikipedia has become a reference knowledge source for scores of NLP applications. One of its invaluable features lies in its multilingual nature, where articles on a same entity or concept can have from one to more than 200 different versions. The interlinking of language versions in Wikipedia has undergone a major renewal with the advent of Wikidata, a unified scheme to identify entities and their properties using unique numbers. However, as the interlinking is still manually carried out by thousands of editors across the globe, errors may creep in the assignment of entities. In this paper, we describe an optimization technique to match automatically language versions of articles, and hence entities, that is only based on bags of words and anchors. We created a dataset of all the articles on persons we extracted from Wikipedia in six languages: English, French, German, Russian, Spanish, and Swedish. We report a correct match of at least 94.3% on each pair.

2015

pdf bib
Linking Entities Across Images and Text
Rebecka Weegar | Kalle Åström | Pierre Nugues
Proceedings of the Nineteenth Conference on Computational Natural Language Learning

pdf bib
A Distant Supervision Approach to Semantic Role Labeling
Peter Exner | Marcus Klang | Pierre Nugues
Proceedings of the Fourth Joint Conference on Lexical and Computational Semantics

pdf bib
Extraction of lethal events from Wikipedia and a semantic repository
Magnus Norrby | Pierre Nugues
Proceedings of the workshop on Semantic resources and semantic annotation for Natural Language Processing and the Digital Humanities at NODALIDA 2015

2014

pdf bib abs
REFRACTIVE: An Open Source Tool to Extract Knowledge from Syntactic and Semantic Relations
Peter Exner | Pierre Nugues
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The extraction of semantic propositions has proven instrumental in applications like IBM Watson and in Google’s knowledge graph . One of the core components of IBM Watson is the PRISMATIC knowledge base consisting of one billion propositions extracted from the English version of Wikipedia and the New York Times. However, extracting the propositions from the English version of Wikipedia is a time-consuming process. In practice, this task requires multiple machines and a computation distribution involving a good deal of system technicalities. In this paper, we describe Refractive, an open-source tool to extract propositions from a parsed corpus based on the Hadoop variant of MapReduce. While the complete process consists of a parsing part and an extraction part, we focus here on the extraction from the parsed corpus and we hope this tool will help computational linguists speed up the development of applications.

2012

pdf bib abs
Constructing Large Proposition Databases
Peter Exner | Pierre Nugues
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

With the advent of massive online encyclopedic corpora such as Wikipedia, it has become possible to apply a systematic analysis to a wide range of documents covering a significant part of human knowledge. Using semantic parsers, it has become possible to extract such knowledge in the form of propositions (predicate―argument structures) and build large proposition databases from these documents. This paper describes the creation of multilingual proposition databases using generic semantic dependency parsing. Using Wikipedia, we extracted, processed, clustered, and evaluated a large number of propositions. We built an architecture to provide a complete pipeline dealing with the input of text, extraction of knowledge, storage, and presentation of the resulting propositions.

pdf bib abs
Visualizing Sentiment Analysis on a User Forum
Rasmus Sundberg | Anders Eriksson | Johan Bini | Pierre Nugues
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Sentiment analysis, or opinion mining, is the process of extracting sentiment from documents or sentences, where the expressed sentiment is typically categorized as positive, negative, or neutral. Many different techniques have been proposed. In this paper, we report the reimplementation of nine algorithms and their evaluation across four corpora to assess the sentiment at the sentence level. We extracted the named entities from each sentence and we associated them with the sentence sentiment. We built a graphical module based on the Qlikview software suite to visualize the sentiments attached to named entities mentioned in Internet forums and follow opinion changes over time.

pdf bib
Using Syntactic Dependencies to Solve Coreferences
Marcus Stamborg | Dennis Medved | Peter Exner | Pierre Nugues
Joint Conference on EMNLP and CoNLL - Shared Task

2011

pdf bib
Exploring Lexicalized Features for Coreference Resolution
Anders Björkelund | Pierre Nugues
Proceedings of the Fifteenth Conference on Computational Natural Language Learning: Shared Task

2010

pdf bib
Automatic Discovery of Feature Sets for Dependency Parsing
Peter Nilsson | Pierre Nugues
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)

pdf bib
A High-Performance Syntactic and Semantic Dependency Parser
Anders Björkelund | Bernd Bohnet | Love Hafdell | Pierre Nugues
Coling 2010: Demonstrations

2009

pdf bib
Multilingual Semantic Role Labeling
Anders Björkelund | Love Hafdell | Pierre Nugues
Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL 2009): Shared Task

pdf bib
Predictive Text Entry using Syntax and Semantics
Sebastian Ganslandt | Jakob Jörwall | Pierre Nugues
Proceedings of the 11th International Conference on Parsing Technologies (IWPT’09)

pdf bib
Text Categorization Using Predicate-Argument Structures
Jacob Persson | Richard Johansson | Pierre Nugues
Proceedings of the 17th Nordic Conference of Computational Linguistics (NODALIDA 2009)

2008

pdf bib
The Effect of Syntactic Representation on Semantic Role Labeling
Richard Johansson | Pierre Nugues
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)

pdf bib
Dependency-based Semantic Role Labeling of PropBank
Richard Johansson | Pierre Nugues
Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing

pdf bib abs
Comparing Dependency and Constituent Syntax for Frame-semantic Analysis
Richard Johansson | Pierre Nugues
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

We address the question of which syntactic representation is best suited for role-semantic analysis of English in the FrameNet paradigm. We compare systems based on dependencies and constituents, and a dependency syntax with a rich set of grammatical functions with one with a smaller set. Our experiments show that dependency-based and constituent-based analyzers give roughly equivalent performance, and that a richer set of functions has a positive influence on argument classification for verbs.

pdf bib
Dependency-based Syntactic–Semantic Analysis with PropBank and NomBank
Richard Johansson | Pierre Nugues
CoNLL 2008: Proceedings of the Twelfth Conference on Computational Natural Language Learning

2007

pdf bib abs
Évaluation des stades de développement en français langue étrangère
Jonas Granfeldt | Pierre Nugues
Actes de la 14ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Cet article décrit un système pour définir et évaluer les stades de développement en français langue étrangère. L’évaluation de tels stades correspond à l’identification de la fréquence de certains phénomènes lexicaux et grammaticaux dans la production des apprenants et comment ces fréquences changent en fonction du temps. Les problèmes à résoudre dans cette démarche sont triples : identifier les attributs les plus révélateurs, décider des points de séparation entre les stades et évaluer le degré d’efficacité des attributs et de la classification dans son ensemble. Le système traite ces trois problèmes. Il se compose d’un analyseur morphosyntaxique, appelé Direkt Profil, auquel nous avons relié un module d’apprentissage automatique. Dans cet article, nous décrivons les idées qui ont conduit au développement du système et son intérêt. Nous présentons ensuite le corpus que nous avons utilisé pour développer notre analyseur morphosyntaxique. Enfin, nous présentons les résultats sensiblement améliorés des classificateurs comparé aux travaux précédents (Granfeldt et al., 2006). Nous présentons également une méthode de sélection de paramètres afin d’identifier les attributs grammaticaux les plus appropriés.

pdf bib
Incremental Dependency Parsing Using Online Learning
Richard Johansson | Pierre Nugues
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)

pdf bib
LTH: Semantic Structure Extraction using Nonprojective Dependency Trees
Richard Johansson | Pierre Nugues
Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007)

pdf bib
Evaluating Stages of Development in Second Language French: A Machine-Learning Approach
Jonas Granfeldt | Pierre Nugues
Proceedings of the 16th Nordic Conference of Computational Linguistics (NODALIDA 2007)

pdf bib
Extended Constituent-to-Dependency Conversion for English
Richard Johansson | Pierre Nugues
Proceedings of the 16th Nordic Conference of Computational Linguistics (NODALIDA 2007)

2006

pdf bib
A Machine Learning Approach to Extract Temporal Information from Texts in Swedish and Generate Animated 3D Scenes
Anders Berglund | Richard Johansson | Pierre Nugues
11th Conference of the European Chapter of the Association for Computational Linguistics

pdf bib
Automatic Annotation for All Semantic Layers in FrameNet
Richard Johansson | Pierre Nugues
Demonstrations

pdf bib abs
Extraction of Temporal Information from Texts in Swedish
Anders Berglund | Richard Johansson | Pierre Nugues
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

This paper describes the implementation and evaluation of a generic component to extract temporal information from texts in Swedish. It proceeds in two steps. The first step extracts time expressions and events, and generates a feature vector for each element it identifies. Using the vectors, the second step determines the temporal relations, possibly none, between the extracted events and orders them in time. We used a machine learning approach to find the relations between events. To run the learning algorithm, we collected a corpus of road accident reports from newspapers websites that we manually annotated. It enabled us to train decision trees and to evaluate the performance of the algorithm.

pdf bib abs
Construction of a FrameNet Labeler for Swedish Text
Richard Johansson | Pierre Nugues
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

We describe the implementation of a FrameNet-based semantic role labeling system for Swedish text. To train the system, we used a semantically annotated corpus that was produced by projection across parallel corpora. As part of the system, we developed two frame element bracketing algorithms that are suitable when no robust constituent parsers are available. Apart from being the first such system for Swedish, this is, as far as we are aware, the first semantic role labeling system for a language for which no role-semantic annotated corpora are available. The estimated accuracy of classification of pre-segmented frame elements is 0.75, and the precision and recall measures for the complete task are 0.67 and 0.47, respectively.

pdf bib abs
CEFLE and Direkt Profil: a New Computer Learner Corpus in French L2 and a System for Grammatical Profiling
Jonas Granfeldt | Pierre Nugues | Malin Ågren | Jonas Thulin | Emil Persson | Suzanne Schlyter
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

The importance of computer learner corpora for research in both second language acquisition and foreign language teaching is rapidly increasing. Computer learner corpora can provide us with data to describe the learners interlanguage system at different points of its development and they can be used to create pedagogical tools. In this paper, we first present a new computer learner corpus in French. We then describe an analyzer called Direkt Profil, that we have developed using this corpus. The system carries out a sentence analysis based on developmental sequences, i.e. local morphosyntactic phenomena linked to a development in the acquisition of French as a foreign language. We present a brief introduction to developmental sequences and some examples in French. In the final section, we introduce and evaluate a method to optimize the definition and detection of learner profiles using machine-learning techniques.

pdf bib
A FrameNet-Based Semantic Role Labeler for Swedish
Richard Johansson | Pierre Nugues
Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions

pdf bib
Investigating Multilingual Dependency Parsing
Richard Johansson | Pierre Nugues
Proceedings of the Tenth Conference on Computational Natural Language Learning (CoNLL-X)

2005

pdf bib abs
Direkt Profil : un système d’évaluation de textes d’élèves de français langue étrangère fondé sur les itinéraires d’acquisition
Jonas Granfeldt | Pierre Nugues | Emil Persson | Lisa Persson | Fabian Kostadinov | Malin Ågren | Suzanne Schlytere
Actes de la 12ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Direkt Profil est un analyseur automatique de textes écrits en français comme langue étrangère. Son but est de produire une évaluation du stade de langue des élèves sous la forme d’un profil d’apprenant. Direkt Profil réalise une analyse des phrases fondée sur des itinéraires d’acquisition, i.e. des phénomènes morphosyntaxiques locaux liés à un développement dans l’apprentissage du français. L’article présente les corpus que nous traitons et d’une façon sommaire les itinéraires d’acquisition. Il décrit ensuite l’annotation que nous avons définie, le moteur d’analyse syntaxique et l’interface utilisateur. Nous concluons par les résultats obtenus jusqu’ici : sur le corpus de test, le système obtient un rappel de 83% et une précision de 83%.

pdf bib
Direkt Profil: A System for Evaluating Texts of Second Language Learners of French Based on Developmental Sequences
Jonas Granfeldt | Pierre Nugues | Emil Persson | Lisa Persson | Fabian Kostadinov | Malin Ågren | Suzanne Schlyter
Proceedings of the Second Workshop on Building Educational Applications Using NLP