Teodor - George Marchitan

Also published as: Teodor-George Marchitan, Teodor-george Marchitan

2026

On the Intelligibility of Romance Language Varieties: Spanish and Portuguese in Europe and America
Liviu P. Dinu | Ana Sabina Uban | Teodor-George Marchitan | Ioan-Bogdan Iordache | Simona Georgescu
Proceedings of the 13th Workshop on NLP for Similar Languages, Varieties and Dialects

Mutual intelligibility within language families presents a significant challenge for multilingual NLP, particularly due to the prevalence of dialectal variation and asymmetric comprehension. In this paper, we present a corpus-based computational analysis to quantify linguistic proximity across Romance language variants, with a focus on major Spanish (Argentine, Chilean and European) and Portuguese (Brazilian and European) varieties and the other main Romance languages (Italian, French, Romanian). We apply a computational metric of lexical intelligibility based on surface and semantic similarity of related words to measure mutual intelligibility for the five main Romance languages in relation to the Spanish and Portuguese varieties studied.

2025

pdf bib abs

Team Unibuc - NLP at SemEval-2025 Task 11: Few-shot text-based emotion detection
Claudiu Creanga | Teodor - George Marchitan | Liviu Dinu
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

This paper describes the approach of the Unibuc - NLP team in tackling the SemEval 2025 Workshop, Task 11: Bridging the Gap in Text-Based Emotion Detection. We mainly focused on experiments using large language models (Gemini, Qwen, DeepSeek) with either few-shot prompting or fine-tuning. Withour final system, for the multi-label emotion detection track (track A), we got an F1-macro of 0.7546 (26/96 teams) for the English subset, 0.1727 (35/36 teams) for the Portuguese (Mozambican) subset and 0.325 (1/31 teams) for the Emakhuwa subset.

pdf bib abs

Team Unibuc - NLP at GenAI Detection Task 1: Qwen it detect machine-generated text?
Claudiu Creanga | Teodor-George Marchitan | Liviu P. Dinu
Proceedings of the 1stWorkshop on GenAI Content Detection (GenAIDetect)

We explored both masked language models and causal models. For Subtask A, our best model achieved first-place out of 36 teams when looking at F1 Micro (Auxiliary Score) of 0.8333, and second-place when looking at F1 Macro (Main Score) of 0.8301. For causal models, our best model was a fine-tuned version of Qwen and for masked models, our best model was a fine-tuned version of XLM-Roberta-Base.

2024

pdf bib abs

Verba volant, scripta volant? Don’t worry! There are computational solutions for protoword reconstruction
Liviu P Dinu | Ana Sabina Uban | Alina Maria Cristea | Ioan-Bogdan Iordache | Teodor-George Marchitan | Simona Georgescu | Laurentiu Zoicas
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

We introduce a new database of cognate words and etymons for the five main Romance languages, the most comprehensive one to date. We propose a strong benchmark for the automatic reconstruction of protowords for Romance languages, by applying a set of machine learning models and features on these data. The best results reach 90% accuracy in predicting the protoword of a given cognate set, surpassing existing state-of-the-art results for this task and showing that computational methods can be very useful in assisting linguists with protoword reconstruction.

pdf bib abs

Team Unibuc - NLP at SemEval-2024 Task 8: Transformer and Hybrid Deep Learning Based Models for Machine-Generated Text Detection
Teodor-george Marchitan | Claudiu Creanga | Liviu P. Dinu
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)

This paper describes the approach of the UniBuc - NLP team in tackling the SemEval 2024 Task 8: Multigenerator, Multidomain, and Multilingual Black-Box Machine-Generated Text Detection. We explored transformer-based and hybrid deep learning architectures. For subtask B, our transformer-based model achieved a strong second-place out of 77 teams with an accuracy of 86.95%, demonstrating the architecture’s suitability for this task. However, our models showed overfitting in subtask A which could potentially be fixed with less fine-tunning and increasing maximum sequence length. For subtask C (token-level classification), our hybrid model overfit during training, hindering its ability to detect transitions between human and machine-generated text.

Co-authors

Alina Maria Cristea 1

Laurentiu Zoicas 1

Venues

Fix author