Daria Sinitsyna
2024
Comparative Evaluation of Large Language Models for Linguistic Quality Assessment in Machine Translation
Daria Sinitsyna
|
Konstantin Savenkov
Proceedings of the 16th Conference of the Association for Machine Translation in the Americas (Volume 2: Presentations)
Building on our GPT-4 LQA research in MT, this study identifies top LLMs for an LQA pipeline with up to three models. LLMs like GPT-4, GPT-4o, GPT-4 Turbo, Google Vertex, Anthropic’s Claude 3, and Llama-3 are prompted using MQM error typology. These models generate segment-wise outputs describing translation errors, scored by severity and DQF-MQM penalties. The study evaluates four language pairs: English-Spanish, English-Chinese, English-German, and English-Portuguese, using datasets from our 2024 State of MT Report across eight domains. LLM outputs are correlated with human judgments, ranking models by alignment with human assessments for penalty score, issue presence, type, and severity. This research proposes an LQA pipeline with up to three models, weighted by output quality, highlighting LLMs’ potential to enhance MT review processes and improve translation quality.
2020
Language Models for Cloze Task Answer Generation in Russian
Anastasia Nikiforova
|
Sergey Pletenev
|
Daria Sinitsyna
|
Semen Sorokin
|
Anastasia Lopukhina
|
Nick Howell
Proceedings of the Second Workshop on Linguistic and Neurocognitive Resources
Linguistics predictability is the degree of confidence in which language unit (word, part of speech, etc.) will be the next in the sequence. Experiments have shown that the correct prediction simplifies the perception of a language unit and its integration into the context. As a result of an incorrect prediction, language processing slows down. Currently, to get a measure of the language unit predictability, a neurolinguistic experiment known as a cloze task has to be conducted on a large number of participants. Cloze tasks are resource-consuming and are criticized by some researchers as an insufficiently valid measure of predictability. In this paper, we compare different language models that attempt to simulate human respondents’ performance on the cloze task. Using a language model to create cloze task simulations would require significantly less time and conduct studies related to linguistic predictability.
Search