2024
pdf
bib
abs
Compilation of a Synthetic Judeo-French Corpus
Iglika Nikolova-Stoupak
|
Gaél Lejeune
|
Eva Schaeffer-Lacroix
Proceedings of the 8th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2024)
This is a short paper describing the process of derivation of synthetic Judeo-French text. Judeo-French is one of a number of rare languages used in speaking and writing by Jewish communities as confined to a particular temporal and geographical frame (in this case, 11th- to 14th-century France). The number of resources in the language is very limited and its involvement in the contemporary domain of Natural Language Processing (NLP) is practically non-existent. This work outlines the compilation of a synthetic Judeo-French corpus. For the purpose, a pipeline of transformations is applied to Old French text belonging to the same general time period, leading to the derivation of text that is as reliable as possible in terms of phonological, morphological and lexical characteristics as witnessed in Judeo-French. Ultimately, the goal is for this synthetic corpus to be used in standard NLP tasks, such as Neural Machine Translation (NMT), as an instance of data augmentation.
pdf
bib
Generating Contexts for ESP Vocabulary Exercises with LLMs
Iglika Nikolova-Stoupak
|
Serge Bibauw
|
Amandine Dumont
|
Françoise Stas
|
Patrick Watrin
|
Thomas François
Proceedings of the 13th Workshop on Natural Language Processing for Computer Assisted Language Learning
pdf
bib
abs
LLM-Generated Contexts to Practice Specialised Vocabulary: Corpus Presentation and Comparison
Iglika Nikolova-Stoupak
|
Serge Bibauw
|
Amandine Dumont
|
Françoise Stas
|
Patrick Watrin
|
Thomas François
Actes de la 31ème Conférence sur le Traitement Automatique des Langues Naturelles, volume 1 : articles longs et prises de position
This project evaluates the potential of LLM and dynamic corpora to generate contexts ai- med at the practice and acquisition of specialised English vocabulary. We compared reference contexts—handpicked by expert teachers—for a specialised vocabulary list to contexts generated by three recent large language models (LLM) of different sizes (Mistral-7B-Instruct, Vicuna-13B, and Gemini 1.0 Pro) and to contexts extracted from articles web-crawled from specialised websites. The comparison uses a representative set of length-based, morphosyntactic, semantic, and discourse- related textual characteristics. We conclude that the LLM-based corpora can be combined effectively with a web-crawled one to form an academic corpus characterised by appropriate complexity and textual variety.
pdf
bib
abs
Contemporary LLMs and Literary Abridgement: An Analytical Inquiry
Iglika Nikolova-Stoupak
|
Gaël Lejeune
|
Eva Schaeffer-Lacroix
Proceedings of the Sixth International Conference on Computational Linguistics in Bulgaria (CLIB 2024)
Within the framework of this study, several contemporary Large Language Models (ChatGPT, Gemini Pro, Mistral-Instruct and BgGPT) are evaluated in relation to their ability to generate abridged versions of literary texts. The analysis is based on ’The Ugly Duckling’ by H. C. Andersen as translated into English, French and Bulgarian. The different scenarios of abridgement experimented with include zero-shot, one-shot, division into chunks and crosslingual (including chain-of-thought) abridgement. The resulting texts are evaluated both automatically and via human evaluation. The automatic analysis includes ROUGE and BERTScore as well as the ratios of a selection of readability-related textual features (e.g. number of words, type-to-token ratio) as pertaining to the original versus automatically abridged texts. Professionally composed abridged versions are regarded as gold standard. Following the automatic analysis, six selected best candidate texts per language are then evaluated by volunteers with university education in terms of textual characteristics of a more qualitative nature, such as coherence, consistency and aesthetic appeal.
pdf
bib
abs
Extended Context at the Introduction of Complex Vocabulary in Abridged Literary Texts
Iglika Nikolova-Stoupak
|
Eva Schaeffer-Lacroix
|
Gaël Lejeune
Proceedings of the Sixth International Conference on Computational Linguistics in Bulgaria (CLIB 2024)
Psycholinguistics speaks of a fine-tuning process used by parents as they address children, in which complex vocabulary is introduced with additional context (Leung et al., 2021). This somewhat counterintuitive lengthening of text in order to aid one’s interlocutor in the process of language acquisition also comes in accord with Harris (1988)’s notion that for every complex sentence, there is an equivalent longer (non-contracted) yet simpler one that contains the same amount of information. Within the proposed work, a corpus of eight renowned literary works (e.g. Alice’s Adventures in Wonderland, The Adventures of Tom Sawyer, Les Misérables) in four distinct languages (English, French, Russian and Spanish) is gathered: both the original (or translated) versions and up to four abridged versions for various audiences (e.g. children of a defined age or foreign language learners of a defined level) are present. The contexts of the first appearance of complex words (as determined based on word frequency) in pairs of original and abridged works are compared, and the cases in which the abridged texts offer longer context are investigated. The discovered transformations are consequently classified into three separate categories: addition of vocabulary items from the same lexical field as the complex word, simplification of grammar and insertion of a definition. Context extensions are then statistically analysed as associated with different languages and reader audiences.
2022
pdf
bib
abs
Filtering of Noisy Web-Crawled Parallel Corpus: the Japanese-Bulgarian Language Pair
Iglika Nikolova-Stoupak
|
Shuichiro Shimizu
|
Chenhui Chu
|
Sadao Kurohashi
Proceedings of the Fifth International Conference on Computational Linguistics in Bulgaria (CLIB 2022)
One of the main challenges within the rapidly developing field of neural machine translation is its application to low-resource languages. Recent attempts to provide large parallel corpora in rare language pairs include the generation of web-crawled corpora, which may be vast but are, unfortunately, excessively noisy. The corpus utilised to train machine translation models in the study is CCMatrix, provided by OPUS. Firstly, the corpus is cleaned based on a number of heuristic rules. Then, parts of it are selected in three discrete ways: at random, based on the “margin distance” metric that is native to the CCMatrix dataset, and based on scores derived through the application of a state-of-the-art classifier model (Acarcicek et al., 2020) utilised in a thematic WMT shared task. The performance of the issuing models is evaluated and compared. The classifier-based model does not reach high performance as compared with its margin-based counterpart, opening a discussion of ways for further improvement. Still, BLEU scores surpass those of Acarcicek et al.’s (2020) paper by over 15 points.
2020
pdf
bib
abs
A Natural Language for Bulgarian Primary and Secondary Education
Iglika Nikolova-Stoupak
Proceedings of the Fourth International Conference on Computational Linguistics in Bulgaria (CLIB 2020)
This paper examines the qualities and applicability of a provisional programming language, especially designed for use by beginner-level students in Bulgarian primary and secondary schools. The necessity for such a language is investigated. Then, relevant features are defined, as inspired by various programming languages (notably, languages used in education and characterised with non- English syntax) and by general trends related to the achievement of natural language in software development. A survey is conducted to test young students’ interaction with the language, and the latter’s advantages and limitations are listed and discussed.