Building a Corpus for Corporate Websites Machine Translation Evaluation. A Step by Step Methodological Approach

Irene Rivera-Trigueros, María-Dolores Olvera-Lobo


Abstract
The aim of this paper is to describe the process carried out to develop a paral-lel corpus comprised of texts extracted from the corporate websites of south-ern Spanish SMEs from the sanitary sector which will serve as the basis for MT quality assessment. The stages for compiling the parallel corpora were: (i) selection of websites with content translated in English and Spanish, (ii) downloading of the HTML files of the selected websites, (iii) files filtering and pairing of English files with their Spanish equivalents, (iv) compilation of individual corpora (EN and ES) for each of the selected websites, (v) merging of the individual corpora into a two general corpus one in English and the other in Spanish, (vi) selection a representative sample of segments to be used as original (ES) and reference translations (EN), (vii) building of the parallel corpus intended for MT evaluation. The parallel corpus generated will serve to future Machine Translation quality assessment. In addition, the monolingual corpora generated during the process could as a base to carry out research focused on linguistic – bilingual or monolingual − analysis.
Anthology ID:
2021.triton-1.11
Volume:
Proceedings of the Translation and Interpreting Technology Online Conference
Month:
July
Year:
2021
Address:
Held Online
Editors:
Ruslan Mitkov, Vilelmini Sosoni, Julie Christine Giguère, Elena Murgolo, Elizabeth Deysel
Venue:
TRITON
SIG:
Publisher:
INCOMA Ltd.
Note:
Pages:
93–101
Language:
URL:
https://aclanthology.org/2021.triton-1.11
DOI:
Bibkey:
Cite (ACL):
Irene Rivera-Trigueros and María-Dolores Olvera-Lobo. 2021. Building a Corpus for Corporate Websites Machine Translation Evaluation. A Step by Step Methodological Approach. In Proceedings of the Translation and Interpreting Technology Online Conference, pages 93–101, Held Online. INCOMA Ltd..
Cite (Informal):
Building a Corpus for Corporate Websites Machine Translation Evaluation. A Step by Step Methodological Approach (Rivera-Trigueros & Olvera-Lobo, TRITON 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.triton-1.11.pdf