Mohammed Hossein Jafari Harandi
2024
EPOQUE: An English-Persian Quality Estimation Dataset
Mohammed Hossein Jafari Harandi
|
Fatemeh Azadi
|
Mohammad Javad Dousti
|
Heshaam Faili
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Translation quality estimation (QE) is an important component in real-world machine translation applications. Unfortunately, human labeled QE datasets, which play an important role in developing and assessing QE models, are only available for limited language pairs. In this paper, we present the first English-Persian QE dataset, called EPOQUE, which has manually annotated direct assessment labels. EPOQUE contains 1000 sentences translated from English to Persian and annotated by three human annotators. It is publicly available, and thus can be used as a zero-shot test set, or for other scenarios in future work. We also evaluate and report the performance of two state-of-the-art QE models, i.e., Transquest and CometKiwi, as baselines on our dataset. Furthermore, our experiments show that using a small subset of the proposed dataset containing 300 sentences to fine-tune Transquest, can improve its performance by more that 8% in terms of the Pearson correlation with a held-out test set.