SmartBiC: Smart Harvesting of Bilingual Corpora from the Internet

Gema Ramírez-Sánchez, Sergio Ortiz Rojas, Alicia Núñez Alcover, Tudor Mateiu, Mikel Forcada, Pedro Orzas, Almudena Carrillo, Giuseppe Nolasco, Noelia Listón


Abstract
SmartBiC, an 18-month innovation project funded by the Spanish Government, aims at improving the full process of collecting, filtering and selecting in-domain parallel content to be used for machine translation and language model tuning purposes in industrial settings. Based on state-of-the-art technology in the free/open-source parallel web corpora harvester Bitextor, SmartBic develops a web-based application around it including novel components such as a language- and domain-focused crawler and a domain-specific corpora selector. SmartBic also addresses specific industrial use cases for individual components of the Bitextor pipeline, such as parallel data cleaning. Relevant improvements to the current Bitextor pipeline will be publicly released.
Anthology ID:
2024.eamt-2.5
Volume:
Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 2)
Month:
June
Year:
2024
Address:
Sheffield, UK
Editors:
Carolina Scarton, Charlotte Prescott, Chris Bayliss, Chris Oakley, Joanna Wright, Stuart Wrigley, Xingyi Song, Edward Gow-Smith, Mikel Forcada, Helena Moniz
Venue:
EAMT
SIG:
Publisher:
European Association for Machine Translation (EAMT)
Note:
Pages:
8–9
Language:
URL:
https://aclanthology.org/2024.eamt-2.5
DOI:
Bibkey:
Cite (ACL):
Gema Ramírez-Sánchez, Sergio Ortiz Rojas, Alicia Núñez Alcover, Tudor Mateiu, Mikel Forcada, Pedro Orzas, Almudena Carrillo, Giuseppe Nolasco, and Noelia Listón. 2024. SmartBiC: Smart Harvesting of Bilingual Corpora from the Internet. In Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 2), pages 8–9, Sheffield, UK. European Association for Machine Translation (EAMT).
Cite (Informal):
SmartBiC: Smart Harvesting of Bilingual Corpora from the Internet (Ramírez-Sánchez et al., EAMT 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.eamt-2.5.pdf