Enikő Héja

2025

In this study, we introduce the Hungarian Generative Model Evaluation (HuGME) benchmark, a new framework designed to assess the linguistic proficiency of large language models (LLMs) in Hungarian. HuGME evaluates models across a diverse set of linguistic and reasoning skills, including bias, toxicity, faithfulness, relevance, summarization, prompt alignment, readability, spelling, grammaticality, and domain-specific knowledge through tasks like TruthfulQA and MMLU. We applied HuGME to a range of Hungarian LLMs, including those developed in-house as well as several publicly available models that claim Hungarian language proficiency. This paper presents the comparative results of these evaluations, shedding light on the capabilities of current LLMs in processing the Hungarian language. Through our analysis, we aim to both showcase the current state of Hungarian linguistic processing in LLMs and provide a foundational resource for future advancements in the field.

2024

pdf bib abs

HuLU: Hungarian Language Understanding Benchmark Kit
Noémi Ligeti-Nagy | Gergő Ferenczi | Enikő Héja | László János Laki | Noémi Vadász | Zijian Győző Yang | Tamás Váradi
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

The paper introduces the Hungarian Language Understanding (HuLU) benchmark, a comprehensive assessment framework designed to evaluate the performance of neural language models on Hungarian language tasks. Inspired by the renowned GLUE and SuperGLUE benchmarks, HuLU aims to address the challenges specific to Hungarian language processing. The benchmark consists of various datasets, each representing different linguistic phenomena and task complexities. Moreover, the paper presents a web service developed for HuLU, offering a user-friendly interface for model evaluation. This platform not only ensures consistent assessment but also fosters transparency by maintaining a leaderboard showcasing model performances. Preliminary evaluations of various LMMs on HuLU datasets indicate that while Hungarian models show promise, there’s room for improvement to match the proficiency of English-centric models in their native language.

2022

pdf bib abs

A Clique-based Graphical Approach to Detect Interpretable Adjectival Senses in Hungarian
Enikő Héja | Noémi Ligeti-Nagy
Proceedings of TextGraphs-16: Graph-based Methods for Natural Language Processing

The present paper introduces an ongoing research which aims to detect interpretable adjectival senses from monolingual corpora applying an unsupervised WSI approach. According to our expectations the findings of our investigation are going to contribute to the work of lexicographers, linguists and also facilitate the creation of benchmarks with semantic information for the NLP community. For doing so, we set up four criteria to distinguish between senses. We experiment with a graphical approach to model our criteria and then perform a detailed, linguistically motivated manual evaluation of the results.

2012

pdf bib

Automatically Generated Customizable Online Dictionaries
Enikő Héja | Dávid Takács
Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics

pdf bib abs

Automatically Generated Online Dictionaries
Enikő Héja | Dávid Takács
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

The aim of our software presentation is to demonstrate that corpus-driven bilingual dictionaries generated fully by automatic means are suitable for human use. Need for such dictionaries shows up specifically in the case of lesser used languages where due to the low demand it does not pay off for publishers to invest into the production of dictionaries. Previous experiments have proven that bilingual lexicons can be created by applying word alignment on parallel corpora. Such an approach, especially the corpus-driven nature of it, yields several advantages over more traditional approaches. Most importantly, automatically attained translation probabilities are able to guarantee that the most frequently used translations come first within an entry. However, the proposed technique have to face some difficulties, as well. In particular, the scarce availability of parallel texts for medium density languages imposes limitations on the size of the resulting dictionary. Our objective is to design and implement a dictionary building workflow and a query system that is apt to exploit the additional benefits of the method and overcome the disadvantages of it.

2010

pdf bib abs

The Role of Parallel Corpora in Bilingual Lexicography
Enikő Héja
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

This paper describes an approach based on word alignment on parallel corpora, which aims at facilitating the lexicographic work of dictionary building. Although this method has been widely used in the MT community for at least 16 years, as far as we know, it has not been applied to facilitate the creation of bilingual dictionaries for human use. The proposed corpus-driven technique, in particular the exploitation of parallel corpora, proved to be helpful in the creation of such dictionaries for several reasons. Most importantly, a parallel corpus of appropriate size guarantees that the most relevant translations are included in the dictionary. Moreover, based on the translational probabilities it is possible to rank translation candidates, which ensures that the most frequently used translation variants go first within an entry. A further advantage is that all the relevant example sentences from the parallel corpora are easily accessible, thus facilitating the selection of the most appropriate translations from possible translation candidates. Due to these properties the method is particularly apt to enable the production of active or encoding dictionaries.

Enikő Héja

2025

2024

2022

2012

2010

2007

Co-authors

Venues