A light way to collect comparable corpora from the Web

Ahmet Aker, Evangelos Kanoulas, Robert Gaizauskas


Abstract
Statistical Machine Translation (SMT) relies on the availability of rich parallel corpora. However, in the case of under-resourced languages, parallel corpora are not readily available. To overcome this problem previous work has recognized the potential of using comparable corpora as training data. The process of obtaining such data usually involves (1) downloading a separate list of documents for each language, (2) matching the documents between two languages usually by comparing the document contents, and finally (3) extracting useful data for SMT from the matched document pairs. This process requires a large amount of time and resources since a huge volume of documents needs to be downloaded to increase the chances of finding good document pairs. In this work we aim to reduce the amount of time and resources spent for tasks 1 and 2. Instead of obtaining full documents we first obtain just titles along with some meta-data such as time and date of publication. Titles can be obtained through Web Search and RSS News feed collections so that download of the full documents is not needed. We show experimentally that titles can be used to approximate the comparison between documents using full document contents.
Anthology ID:
L12-1359
Volume:
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Month:
May
Year:
2012
Address:
Istanbul, Turkey
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
15–20
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/626_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Ahmet Aker, Evangelos Kanoulas, and Robert Gaizauskas. 2012. A light way to collect comparable corpora from the Web. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 15–20, Istanbul, Turkey. European Language Resources Association (ELRA).
Cite (Informal):
A light way to collect comparable corpora from the Web (Aker et al., LREC 2012)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/626_Paper.pdf