OCR Processing of Swedish Historical Newspapers Using Deep Hybrid CNNLSTM Networks
Molly Brandt Skelbye | Dana Dannélls
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

Deep CNN–LSTM hybrid neural networks have proven to improve the accuracy of Optical Character Recognition (OCR) models for different languages. In this paper we examine to what extent these networks improve the OCR accuracy rates on Swedish historical newspapers. By experimenting with the open source OCR engine Calamari, we are able to show that mixed deep CNN–LSTM hybrid models outperform previous models on the task of character recognition of Swedish historical newspapers spanning 1818–1848. We achieved an average character accuracy rate (CAR) of 97.43% which is a new state–of–the–art result on 19th century Swedish newspaper text. Our data, code and models are released under CC-BY licence.

A Novel Machine Learning Based Approach for Post-OCR Error Detection
Shafqat Mumtaz Virk | Dana Dannélls | Azam Sheikh Muhammad
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

Post processing is the most conventional approach for correcting errors that are caused by Optical Character Recognition(OCR) systems. Two steps are usually taken to correct OCR errors: detection and corrections. For the first task, supervised machine learning methods have shown state-of-the-art performances. Previously proposed approaches have focused most prominently on combining lexical, contextual and statistical features for detecting errors. In this study, we report a novel system to error detection which is based merely on the n-gram counts of a candidate token. In addition to being simple and computationally less expensive, our proposed system beats previous systems reported in the ICDAR2019 competition on OCR-error detection with notable margins. We achieved state-of-the-art F1-scores for eight out of the ten involved European languages. The maximum improvement is for Spanish which improved from 0.69 to 0.90, and the minimum for Polish from 0.82 to 0.84.

A Data-Driven Semi-Automatic Framenet Development Methodology
Shafqat Mumtaz Virk | Dana Dannélls | Lars Borin | Markus Forsberg
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

FrameNet is a lexical semantic resource based on the linguistic theory of frame semantics. A number of framenet development strategies have been reported previously and all of them involve exploration of corpora and a fair amount of manual work. Despite previous efforts, there does not exist a well-thought-out automatic/semi-automatic methodology for frame construction. In this paper we propose a data-driven methodology for identification and semi-automatic construction of frames. As a proof of concept, we report on our initial attempts to build a wider-scale framenet for the legal domain (LawFN) using the proposed methodology. The constructed frames are stored in a lexical database and together with the annotated example sentences they have been made available through a web interface.

The Swedish Winogender Dataset
Saga Hansson | Konstantinos Mavromatakis | Yvonne Adesam | Gerlof Bouma | Dana Dannélls
Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)

We introduce the SweWinogender test set, a diagnostic dataset to measure gender bias in coreference resolution. It is modelled after the English Winogender benchmark, and is released with reference statistics on the distribution of men and women between occupations and the association between gender and occupation in modern corpus material. The paper discusses the design and creation of the dataset, and presents a small investigation of the supplementary statistics.


Material Philology Meets Digital Onomastic Lexicography: The NordiCon Database of Medieval Nordic Personal Names in Continental Sources
Michelle Waldispühl | Dana Dannells | Lars Borin
Proceedings of the 12th Language Resources and Evaluation Conference

We present NordiCon, a database containing medieval Nordic personal names attested in Continental sources. The database combines formally interpreted and richly interlinked onomastic data with digitized versions of the medieval manuscripts from which the data originate and information on the tokens’ context. The structure of NordiCon is inspired by other online historical given name dictionaries. It takes up challenges reported on in previous works, such as how to cover material properties of a name token and how to define lemmatization principles, and elaborates on possible solutions. The lemmatization principles for NordiCon are further developed in order to facilitate the connection to other name dictionaries and corpuses, and the integration of the database into Språkbanken Text, an infrastructure containing modern and historical written data.


Polysemy, underspecification, and aspects – Questions of lumping or splitting in the construction of Swedish FrameNet
Karin Friberg Heppin | Dana Dannélls
Proceedings of the workshop on Semantic resources and semantic annotation for Natural Language Processing and the Digital Humanities at NODALIDA 2015

Formalising the Swedish Constructicon in Grammatical Framework
Normunds Gruzitis | Dana Dannélls | Benjamin Lyngfelt | Aarne Ranta
Proceedings of the Grammar Engineering Across Frameworks (GEAF) 2015 Workshop


Using language technology resources and tools to construct Swedish FrameNet
Dana Dannélls | Karin Friberg Heppin | Anna Ehrlemark
Proceedings of Workshop on Lexical and Grammatical Resources for Language Processing

Extracting a bilingual semantic grammar from FrameNet-annotated corpora
Dana Dannélls | Normunds Gruzitis
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We present the creation of an English-Swedish FrameNet-based grammar in Grammatical Framework. The aim of this research is to make existing framenets computationally accessible for multilingual natural language applications via a common semantic grammar API, and to facilitate the porting of such grammar to other languages. In this paper, we describe the abstract syntax of the semantic grammar while focusing on its automatic extraction possibilities. We have extracted a shared abstract syntax from ~58,500 annotated sentences in Berkeley FrameNet (BFN) and ~3,500 annotated sentences in Swedish FrameNet (SweFN). The abstract syntax defines 769 frame-specific valence patterns that cover 77,8% examples in BFN and 74,9% in SweFN belonging to the shared set of 471 frames. As a side result, we provide a unified method for comparing semantic and syntactic valence patterns across framenets.


Multilingual access to cultural heritage content on the Semantic Web
Dana Dannélls | Aarne Ranta | Ramona Enache | Mariana Damova | Maria Mateva
Proceedings of the 7th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities


Toward Language Independent Methodology for Generating Artwork Descriptions – Exploring FrameNet Information
Dana Dannélls | Lars Borin
Proceedings of the 6th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities

On generating coherent multilingual descriptions of museum objects from Semantic Web ontologies
Dana Dannélls
INLG 2012 Proceedings of the Seventh International Natural Language Generation Conference


A Framework for Improved Access to Museum Databases in the Semantic Web
Dana Dannélls | Mariana Damova | Ramona Enache | Milen Chechev
Proceedings of the Workshop on Language Technologies for Digital Humanities and Cultural Heritage


Applying Semantic Frame Theory to Automate Natural Language Template Generation From Ontology Statements
Dana Dannélls
Proceedings of the 6th International Natural Language Generation Conference


Recognizing Acronyms and their Definitions in Swedish Medical Texts
Dimitrios Kokkinakis | Dana Dannélls
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

This paper addresses the task of recognizing acronym-definition pairs in Swedish (medical) texts as well as the compilation of a freely available sample of such manually annotated pairs. A material suitable not only for supervised learning experiments, but also as a testbed for the evaluation of the quality of future acronym-definition recognition systems. There are a number of approaches to the identification described in the literature, particularly within the biomedical domain, but none of those addresses the variation and complexity exhibited in a language other than English. This is realized by the fact that we can have a mixture of two languages in the same document and/or sentence, i.e. Swedish and English; that Swedish is a compound language that significantly deteriorates the performance of previous approaches (without adaptations) and, most importantly, the fact that there is a large variation of possible acronym-definition permutations realized in the analysed corpora, a variation that is usually ignored in previous studies.

Automatic Acronym Recognition
