Building Machine Translation System for Software Product Descriptions Using Domain-specific Sub-corpora Extraction

Pintu Lohar; Sinead Madden; Edmond O’Connor; Maja Popović; Tanya Habruseva

Building Machine Translation System for Software Product Descriptions Using Domain-specific Sub-corpora Extraction

Pintu Lohar, Sinead Madden, Edmond O’Connor, Maja Popovic, Tanya Habruseva

Abstract

Building Machine Translation systems for a specific domain requires a sufficiently large and good quality parallel corpus in that domain. However, this is a bit challenging task due to the lack of parallel data in many domains such as economics, science and technology, sports etc. In this work, we build English-to-French translation systems for software product descriptions scraped from LinkedIn website. Moreover, we developed a first-ever test parallel data set of product descriptions. We conduct experiments by building a baseline translation system trained on general domain and then domain-adapted systems using sentence-embedding based corpus filtering and domain-specific sub-corpora extraction. All the systems are tested on our newly developed data set mentioned earlier. Our experimental evaluation reveals that the domain-adapted model based on our proposed approaches outperforms the baseline.

Anthology ID:: 2022.amta-research.1
Volume:: Proceedings of the 15th biennial conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)
Month:: September
Year:: 2022
Address:: Orlando, USA
Editors:: Kevin Duh, Francisco Guzmán
Venue:: AMTA
SIG:
Publisher:: Association for Machine Translation in the Americas
Note:
Pages:: 1–13
Language:
URL:: https://aclanthology.org/2022.amta-research.1/
DOI:
Bibkey:
Cite (ACL):: Pintu Lohar, Sinead Madden, Edmond O’Connor, Maja Popovic, and Tanya Habruseva. 2022. Building Machine Translation System for Software Product Descriptions Using Domain-specific Sub-corpora Extraction. In Proceedings of the 15th biennial conference of the Association for Machine Translation in the Americas (Volume 1: Research Track), pages 1–13, Orlando, USA. Association for Machine Translation in the Americas.
Cite (Informal):: Building Machine Translation System for Software Product Descriptions Using Domain-specific Sub-corpora Extraction (Lohar et al., AMTA 2022)
Copy Citation:
PDF:: https://aclanthology.org/2022.amta-research.1.pdf

PDF Cite Search Fix data