Correlation between Similarity Measures for Inter-Language Linked Wikipedia Articles

Monica Lestari Paramita, Paul Clough, Ahmet Aker, Robert Gaizauskas


Abstract
Wikipedia articles in different languages have been mined to support various tasks, such as Cross-Language Information Retrieval (CLIR) and Statistical Machine Translation (SMT). Articles on the same topic in different languages are often connected by inter-language links, which can be used to identify similar or comparable content. In this work, we investigate the correlation between similarity measures utilising language-independent and language-dependent features and respective human judgments. A collection of 800 Wikipedia pairs from 8 different language pairs were collected and judged for similarity by two assessors. We report the development of this corpus and inter-assessor agreement between judges across the languages. Results show that similarity measured using language independent features is comparable to using an approach based on translating non-English documents. In both cases the correlation with human judgments is low but also dependent upon the language pair. The results and corpus generated from this work also provide insights into the measurement of cross-language similarity.
Anthology ID:
L12-1220
Volume:
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Month:
May
Year:
2012
Address:
Istanbul, Turkey
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
790–797
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/426_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Monica Lestari Paramita, Paul Clough, Ahmet Aker, and Robert Gaizauskas. 2012. Correlation between Similarity Measures for Inter-Language Linked Wikipedia Articles. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 790–797, Istanbul, Turkey. European Language Resources Association (ELRA).
Cite (Informal):
Correlation between Similarity Measures for Inter-Language Linked Wikipedia Articles (Paramita et al., LREC 2012)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/426_Paper.pdf