2022
pdf
bib
abs
Extended Parallel Corpus for Amharic-English Machine Translation
Andargachew Mekonnen Gezmu
|
Andreas Nürnberger
|
Tesfaye Bayu Bati
Proceedings of the Thirteenth Language Resources and Evaluation Conference
This paper describes the acquisition, preprocessing, segmentation, and alignment of an Amharic-English parallel corpus. It will be helpful for machine translation of a low-resource language, Amharic. We freely released the corpus for research purposes. Furthermore, we developed baseline statistical and neural machine translation systems; we trained statistical and neural machine translation models using the corpus. In the experiments, we also used a large monolingual corpus for the language model of statistical machine translation and back-translation of neural machine translation. In the automatic evaluation, neural machine translation models outperform statistical machine translation models by approximately six to seven Bilingual Evaluation Understudy (BLEU) points. Besides, among the neural machine translation models, the subword models outperform the word-based models by three to four BLEU points. Moreover, two other relevant automatic evaluation metrics, Translation Edit Rate on Character Level and Better Evaluation as Ranking, reflect corresponding differences among the trained models.
2018
pdf
bib
abs
Contemporary Amharic Corpus: Automatically Morpho-Syntactically Tagged Amharic Corpus
Andargachew Mekonnen Gezmu
|
Binyam Ephrem Seyoum
|
Michael Gasser
|
Andreas Nürnberger
Proceedings of the First Workshop on Linguistic Resources for Natural Language Processing
We introduced the contemporary Amharic corpus, which is automatically tagged for morpho-syntactic information. Texts are collected from 25,199 documents from different domains and about 24 million orthographic words are tokenized. Since it is partly a web corpus, we made some automatic spelling error correction. We have also modified the existing morphological analyzer, HornMorpho, to use it for the automatic tagging.
pdf
bib
Portable Spelling Corrector for a Less-Resourced Language: Amharic
Andargachew Mekonnen Gezmu
|
Andreas Nürnberger
|
Binyam Ephrem Seyoum
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
2008
pdf
bib
Arabic/English word translation disambiguation using parallel corpora and matching schemes
Farag Ahmed
|
Andreas Nürnberger
Proceedings of the 12th Annual Conference of the European Association for Machine Translation
pdf
bib
abs
A Comparative Study on Language Identification Methods
Lena Grothe
|
Ernesto William De Luca
|
Andreas Nürnberger
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
In this paper we present two experiments conducted for comparison of different language identification algorithms. Short words-, frequent words- and n-gram-based approaches are considered and combined with the Ad-Hoc Ranking classification method. The language identification process can be subdivided into two main steps: first a document model is generated for the document and a language model for the language; second the language of the document is determined on the basis of the language model and is added to the document as additional information. In this work we present our evaluation results and discuss the importance of a dynamic value for the out-of-place measure.
2006
pdf
bib
abs
Rebuilding Lexical Resources for Information Retrieval using Sense Folder Detection and Merging Methods
Ernesto William De Luca
|
Andreas Nürnberger
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
In this paper we discuss the problem of sense disambiguation using lexical resources like ontologies or thesauri with a focus on the application of sense detection and merging methods in information retrieval systems. For an information retrieval task it is important to detect the meaning of a query word for retrieving the related relevant documents. In order to recognize the meaning of a search word, lexical resources, like WordNet, can be used for word sense disambiguation. But, analyzing the WordNet structure, we see that this ontology is fraught with different problems. The too fine grained distinction between word senses, for example, is unfavorable for a usage in information retrieval. We describe related problems and present four implemented online methods to merge SynSets based on relations like hypernyms and hyponyms, and further context information like glosses and domain. Afterwards we show a first evaluation of our approach, compare the different merging methods and discuss briefly future work.