Beáta Megyesi - ACL Anthology

Beáta Megyesi

Also published as: Beata Megyesi, Beáta Bandmann Megyesi, Beáta B. Megyesi

2025

Prompting the Past: Exploring Zero-Shot Learning for Named Entity Recognition in Historical Texts Using Prompt-Answering LLMs
Crina Tudor | Beata Megyesi | Robert Östling
Proceedings of the 9th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2025)

This paper investigates the application of prompt-answering Large Language Models (LLMs) for the task of Named Entity Recognition (NER) in historical texts. Historical NER presents unique challenges due to language change through time, spelling variation, limited availability of digitized data (and, in particular, labeled data), and errors introduced by Optical Character Recognition (OCR) and Handwritten Text Recognition (HTR) processes. Leveraging the zero-shot capabilities of prompt-answering LLMs, we address these challenges by prompting the model to extract entities such as persons, locations, organizations, and dates from historical documents. We then conduct an extensive error analysis of the model output in order to identify and address potential weaknesses in the entity recognition process. The results show that, while such models display ability for extracting named entities, their overall performance is lackluster. Our analysis reveals that model performance is significantly affected by hallucinations in the model output, as well as by challenges imposed by the evaluation of NER output.

2023

Proceedings of the Second Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2023)
Nikolai Ilinykh | Felix Morger | Dana Dannélls | Simon Dobnik | Beáta Megyesi | Joakim Nivre
Proceedings of the Second Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2023)

2022

Identifying Cleartext in Historical Ciphers
Maria-Elena Gambardella | Beata Megyesi | Eva Pettersson
Proceedings of the Second Workshop on Language Technologies for Historical and Ancient Languages

In historical encrypted sources we can find encrypted text sequences, also called ciphertext, as well as non-encrypted cleartexts written in a known language. While most of the cryptanalysis focuses on the decryption of ciphertext, cleartext is often overlooked although it can give us important clues about the historical interpretation and contextualisation of the manuscript. In this paper, we investigate to what extent we can automatically distinguish cleartext from ciphertext in historical ciphers and to what extent we are able to identify its language. The problem is challenging as cleartext sequences in ciphers are often short, up to a few words, in different languages due to historical code-switching. To identify the sequences and the language(s), we chose a rule-based approach and run 7 different models using historical language models on various ciphertexts.

2020

Towards Privacy by Design in Learner Corpora Research: A Case of On-the-fly Pseudonymization of Swedish Learner Essays
Elena Volodina | Yousuf Ali Mohammed | Sandra Derbring | Arild Matsson | Beata Megyesi
Proceedings of the 28th International Conference on Computational Linguistics

This article reports on an ongoing project aiming at automatization of pseudonymization of learner essays. The process includes three steps: identification of personal information in an unstructured text, labeling for a category, and pseudonymization. We experiment with rule-based methods for detection of 15 categories out of the suggested 19 (Megyesi et al., 2018) that we deem important and/or doable with automatic approaches. For the detection and labeling steps,we use resources covering personal names, geographic names, company and university names and others. For the pseudonymization step, we replace the item using another item of the same type from the above-mentioned resources. Evaluation of the detection and labeling steps are made on a set of manually anonymized essays. The results are promising and show that 89% of the personal information can be successfully identified in learner data, and annotated correctly with an inter-annotator agreement of 86% measured as Fleiss kappa and Krippendorff’s alpha.

2019

Proceedings of the Workshop on NLP and Pseudonymisation
Lars Ahrenberg | Beata Megyesi
Proceedings of the Workshop on NLP and Pseudonymisation

Matching Keys and Encrypted Manuscripts
Eva Pettersson | Beata Megyesi
Proceedings of the 22nd Nordic Conference on Computational Linguistics

Historical cryptology is the study of historical encrypted messages aiming at their decryption by analyzing the mathematical, linguistic and other coding patterns and their historical context. In libraries and archives we can find quite a lot of ciphers, as well as keys describing the method used to transform the plaintext message into a ciphertext. In this paper, we present work on automatically mapping keys to ciphers to reconstruct the original plaintext message, and use language models generated from historical texts to guess the underlying plaintext language.

2018

Learner Corpus Anonymization in the Age of GDPR: Insights from the Creation of a Learner Corpus of Swedish
Beáta Megyesi | Lena Granstedt | Sofia Johansson | Julia Prentice | Dan Rosén | Carl-Johan Schenström | Gunlög Sundberg | Mats Wirén | Elena Volodina
Proceedings of the 7th workshop on NLP for Computer Assisted Language Learning

2017

Annotating errors in student texts: First experiences and experiments
Sara Stymne | Eva Pettersson | Beáta Megyesi | Anne Palmér
Proceedings of the joint workshop on NLP for Computer Assisted Language Learning and NLP for Language Acquisition

SWEGRAM – A Web-Based Tool for Automatic Annotation and Analysis of Swedish Texts
Jesper Näsman | Beáta Megyesi | Anne Palmér
Proceedings of the 21st Nordic Conference on Computational Linguistics

2016

The Uppsala Corpus of Student Writings: Corpus Creation, Annotation, and Analysis
Beáta Megyesi | Jesper Näsman | Anne Palmér
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

The Uppsala Corpus of Student Writings consists of Swedish texts produced as part of a national test of students ranging in age from nine (in year three of primary school) to nineteen (the last year of upper secondary school) who are studying either Swedish or Swedish as a second language. National tests have been collected since 1996. The corpus currently consists of 2,500 texts containing over 1.5 million tokens. Parts of the texts have been annotated on several linguistic levels using existing state-of-the-art natural language processing tools. In order to make the corpus easy to interpret for scholars in the humanities, we chose the CoNLL format instead of an XML-based representation. Since spelling and grammatical errors are common in student writings, the texts are automatically corrected while keeping the original tokens in the corpus. Each token is annotated with part-of-speech and morphological features as well as syntactic structure. The main purpose of the corpus is to facilitate the systematic and quantitative empirical study of the writings of various student groups based on gender, geographic area, age, grade awarded or a combination of these, synchronically or diachronically. The intention is for this to be a monitor corpus, currently under development.

2015

Ranking Relevant Verb Phrases Extracted from Historical Text
Eva Pettersson | Beáta Megyesi | Joakim Nivre
Proceedings of the 9th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH)

Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015)
Beáta Megyesi
Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015)

2014

EACL - Expansion of Abbreviations in CLinical text
Lisa Tengstrand | Beáta Megyesi | Aron Henriksson | Martin Duneld | Maria Kvist
Proceedings of the 3rd Workshop on Predicting and Improving Text Readability for Target Reader Populations (PITR)

A Persian Treebank with Stanford Typed Dependencies
Mojgan Seraji | Carina Jahani | Beáta Megyesi | Joakim Nivre
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We present the Uppsala Persian Dependency Treebank (UPDT) with a syntactic annotation scheme based on Stanford Typed Dependencies. The treebank consists of 6,000 sentences and 151,671 tokens with an average sentence length of 25 words. The data is from different genres, including newspaper articles and fiction, as well as technical descriptions and texts about culture and art, taken from the open source Uppsala Persian Corpus (UPC). The syntactic annotation scheme is extended for Persian to include all syntactic relations that could not be covered by the primary scheme developed for English. In addition, we present open source tools for automatic analysis of Persian containing a text normalizer, a sentence segmenter and tokenizer, a part-of-speech tagger, and a parser. The treebank and the parser have been developed simultaneously in a bootstrapping procedure. The result of a parsing experiment shows an overall labeled attachment score of 82.05% and an unlabeled attachment score of 85.29%. The treebank is freely available as an open source resource.

A Multilingual Evaluation of Three Spelling Normalisation Methods for Historical Text
Eva Pettersson | Beáta Megyesi | Joakim Nivre
Proceedings of the 8th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH)

2013

Normalisation of Historical Text Using Context-Sensitive Weighted Levenshtein Distance and Compound Splitting
Eva Pettersson | Beáta Megyesi | Joakim Nivre
Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013)

2012

Parsing the Past - Identification of Verb Constructions in Historical Text
Eva Pettersson | Beáta Megyesi | Joakim Nivre
Proceedings of the 6th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities

Dependency Parsers for Persian
Mojgan Seraji | Beata Megyesi | Joakim Nivre
Proceedings of the 10th Workshop on Asian Language Resources

A Basic Language Resource Kit for Persian
Mojgan Seraji | Beáta Megyesi | Joakim Nivre
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Persian with its about 100,000,000 speakers in the world belongs to the group of languages with less developed linguistically annotated resources and tools. The few existing resources and tools are neither open source nor freely available. Thus, our goal is to develop open source resources such as corpora and treebanks, and tools for data-driven linguistic analysis of Persian. We do this by exploring the reusability of existing resources and adapting state-of-the-art methods for the linguistic annotation. We present fully functional tools for text normalization, sentence segmentation, tokenization, part-of-speech tagging, and parsing. As for resources, we describe the Uppsala PErsian Corpus (UPEC) which is a modified version of the Bijankhan corpus with additional sentence segmentation and consistent tokenization modified for more appropriate syntactic annotation. The corpus consists of 2,782,109 tokens and is annotated with parts of speech and morphological features. A treebank is derived from UPEC with an annotation scheme based on Stanford Typed Dependencies and is planned to consist of 10,000 sentences of which 215 have already been annotated. Keywords: BLARK for Persian, PoS tagged corpus, Persian treebank

2011

The Copiale Cipher
Kevin Knight | Beáta Megyesi | Christiane Schaefer
Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web

2010

The English-Swedish-Turkish Parallel Treebank
Beáta Megyesi | Bengt Dahlqvist | Éva Á. Csató | Joakim Nivre
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We describe a syntactically annotated parallel corpus containing typologically partly different languages, namely English, Swedish and Turkish. The corpus consists of approximately 300 000 tokens in Swedish, 160 000 in Turkish and 150 000 in English, containing both fiction and technical documents. We build the corpus by using the Uplug toolkit for automatic structural markup, such as tokenization and sentence segmentation, as well as sentence and word alignment. In addition, we use basic language resource kits for the linguistic analysis of the languages involved. The annotation is carried on various layers from morphological and part of speech analysis to dependency structures. The tools used for linguistic annotation, e.g.,\ HunPos tagger and MaltParser, are freely available data-driven resources, trained on existing corpora and treebanks for each language. The parallel treebank is used in teaching and linguistic research to study the relationship between the structurally different languages. In order to study the treebank, several tools have been developed for the visualization of the annotation and alignment, allowing search for linguistic patterns.

2009

The Open Source Tagger HunPoS for Swedish
Beáta Megyesi
Proceedings of the 17th Nordic Conference of Computational Linguistics (NODALIDA 2009)

2008

Language Resources and Tools for Swedish: A Survey
Kjell Elenius | Eva Forsbom | Beáta Megyesi
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Language resources and tools to create and process these resources are necessary components in human language technology and natural language applications. In this paper, we describe a survey of existing language resources for Swedish, and the need for Swedish language resources to be used in research and real-world applications in language technology as well as in linguistic research. The survey is based on a questionnaire sent to industry and academia, institutions and organizations, and to experts involved in the development of Swedish language resources in Sweden, the Nordic countries and world-wide.

Swedish-Turkish Parallel Treebank
Beáta Megyesi | Bengt Dahlqvist | Eva Pettersson | Joakim Nivre
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In this paper, we describe our work on building a parallel treebank for a less studied and typologically dissimilar language pair, namely Swedish and Turkish. The treebank is a balanced syntactically annotated corpus containing both fiction and technical documents. In total, it consists of approximately 160,000 tokens in Swedish and 145,000 in Turkish. The texts are linguistically annotated using different layers from part of speech tags and morphological features to dependency annotation. Each layer is automatically processed by using basic language resources for the involved languages. The sentences and words are aligned, and partly manually corrected. We create the treebank by reusing and adjusting existing tools for the automatic annotation, alignment, and their correction and visualization. The treebank was developed within the project supporting research environment for minor languages aiming at to create representative language resources for language pairs dissimilar in language structure. Therefore, efforts are put on developing a general method for formatting and annotation procedure, as well as using tools that can be applied to other language pairs easily.

2007

The Swedish-Turkish Parallel Corpus and Tools for its Creation
Beata Megyesi | Bengt Dahlqvist
Proceedings of the 16th Nordic Conference of Computational Linguistics (NODALIDA 2007)

Single Malt or Blended? A Study in Multilingual Parser Optimization
Johan Hall | Jens Nilsson | Joakim Nivre | Gülşen Eryiǧit | Beáta Megyesi | Mattias Nilsson | Markus Saers
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)

2006

Building a Swedish-Turkish Parallel Corpus
Beáta Bandmann Megyesi | Anna Sågvall Hein | Éva Csató Johanson
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

We present a SwedishTurkish Parallel Corpus aimed to be used in linguistic research, teaching, and applications in natural language processing, primarily machine translation. The corpus being under development is built by using a Basic LAnguage Resource Kit (BLARK) for the two languages which is then used in the automatic alignment phase to improve alignment accuracy. The corpus is balanced with respect to source and target language and is automatically processed using the Uplug toolkit.

A Study on Automatically Extracted Keywords in Text Categorization
Anette Hulth | Beáta B. Megyesi
Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics

2001

Comparing Data-Driven Learning Algorithms for PoS Tagging of Swedish
Beáta Megyesi
Proceedings of the 2001 Conference on Empirical Methods in Natural Language Processing

Data-Driven Methods for PoS Tagging and Chunking of Swedish
Beáta Megyesi
Proceedings of the 13th Nordic Conference of Computational Linguistics (NODALIDA 2001)

2000

Towards a Finite-State Parser for Swedish
Beáta Megyesi | Sara Rydin
Proceedings of the 12th Nordic Conference of Computational Linguistics (NODALIDA 1999)

1999

Improving Brill’s POS Tagger for an Agglutinative Language
Beata Megyesi
1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora

Venues