2016
pdf
bib
abs
Cross-lingual Linking of Multi-word Entities and their corresponding Acronyms
Guillaume Jacquet
|
Maud Ehrmann
|
Ralf Steinberger
|
Jaakko Väyrynen
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
This paper reports on an approach and experiments to automatically build a cross-lingual multi-word entity resource. Starting from a collection of millions of acronym/expansion pairs for 22 languages where expansion variants were grouped into monolingual clusters, we experiment with several aggregation strategies to link these clusters across languages. Aggregation strategies make use of string similarity distances and translation probabilities and they are based on vector space and graph representations. The accuracy of the approach is evaluated against Wikipedia’s redirection and cross-lingual linking tables. The resulting multi-word entity resource contains 64,000 multi-word entities with unique identifiers and their 600,000 multilingual lexical variants. We intend to make this new resource publicly available.
2014
pdf
bib
abs
DCEP -Digital Corpus of the European Parliament
Najeh Hajlaoui
|
David Kolovratnik
|
Jaakko Väyrynen
|
Ralf Steinberger
|
Daniel Varga
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
We are presenting a new highly multilingual document-aligned parallel corpus called DCEP - Digital Corpus of the European Parliament. It consists of various document types covering a wide range of subject domains. With a total of 1.37 billion words in 23 languages (253 language pairs), gathered in the course of ten years, this is the largest single release of documents by a European Union institution. DCEP contains most of the content of the European Parliament’s official Website. It includes different document types produced between 2001 and 2012, excluding only the documents already exist in the Europarl corpus to avoid overlapping. We are presenting the typical acquisition steps of the DCEP corpus: data access, document alignment, sentence splitting, normalisation and tokenisation, and sentence alignment efforts. The sentence-level alignment is still in progress but based on some first experiments; we showed that DCEP is very useful for NLP applications, in particular for Statistical Machine Translation.
2010
pdf
bib
Evaluating Machine Translations Using mNCD
Marcus Dobrinkat
|
Tero Tapiovaara
|
Jaakko Väyrynen
|
Kimmo Kettunen
Proceedings of the ACL 2010 Conference Short Papers
pdf
bib
Applying Morphological Decompositions to Statistical Machine Translation
Sami Virpioja
|
Jaakko Väyrynen
|
André Mansikkaniemi
|
Mikko Kurimo
Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR
pdf
bib
Normalized Compression Distance Based Measures for MetricsMATR 2010
Marcus Dobrinkat
|
Tero Tapiovaara
|
Jaakko Väyrynen
|
Kimmo Kettunen
Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR
pdf
bib
abs
Language Identification of Short Text Segments with N-gram Models
Tommi Vatanen
|
Jaakko J. Väyrynen
|
Sami Virpioja
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
There are many accurate methods for language identification of long text samples, but identification of very short strings still presents a challenge. This paper studies a language identification task, in which the test samples have only 5-21 characters. We compare two distinct methods that are well suited for this task: a naive Bayes classifier based on character n-gram models, and the ranking method by Cavnar and Trenkle (1994). For the n-gram models, we test several standard smoothing techniques, including the current state-of-the-art, the modified Kneser-Ney interpolation. Experiments are conducted with 281 languages using the Universal Declaration of Human Rights. Advanced language model smoothing techniques improve the identification accuracy and the respective classifiers outperform the ranking method. The higher accuracy is obtained at the cost of larger models and slower classification speed. However, there are several methods to reduce the size of an n-gram model, and our experiments with model pruning show that it provides an easy way to balance the size and the identification accuracy. We also compare the results to the language identifier in Google AJAX Language API, using a subset of 50 languages.