Tan Ngoc Le
Also published as: Tan Ngoc Le
2020
Low-Resource NMT: an Empirical Study on the Effect of Rich Morphological Word Segmentation on Inuktitut
Tan Ngoc Le
|
Fatiha Sadat
Proceedings of the 14th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)
Revitalization of Indigenous Languages through Pre-processing and Neural Machine Translation: The case of Inuktitut
Tan Ngoc Le
|
Fatiha Sadat
Proceedings of the 28th International Conference on Computational Linguistics
Indigenous languages have been very challenging when dealing with NLP tasks and applications because of multiple reasons. These languages, in linguistic typology, are polysynthetic and highly inflected with rich morphophonemics and variable dialectal-dependent spellings; which affected studies on any NLP task in the recent years. Moreover, Indigenous languages have been considered as low-resource and/or endangered; which poses a great challenge for research related to Artificial Intelligence and its fields, such as NLP and machine learning. In this paper, we propose a study on the Inuktitut language through pre-processing and neural machine translation, in order to revitalize the language which belongs to the Inuit family, a type of polysynthetic languages spoken in Northern Canada. Our focus is concentrated on: (1) the preprocessing phase, and (2) applications on specific NLP tasks such as morphological analysis and neural machine translation, both for Indigenous languages of Canada. Our evaluations in the context of lowresource Inuktitut-English Neural Machine Translation, showed significant improvements of the proposed approach compared to the state-of-the-art.
2019
Augmenting Named Entity Recognition with Commonsense Knowledge
Gaith Dekhili
|
Tan Ngoc Le
|
Fatiha Sadat
Proceedings of the 2019 Workshop on Widening NLP
Commonsense can be vital in some applications like Natural Language Understanding (NLU), where it is often required to resolve ambiguity arising from implicit knowledge and underspecification. In spite of the remarkable success of neural network approaches on a variety of Natural Language Processing tasks, many of them struggle to react effectively in cases that require commonsense knowledge. In the present research, we take advantage of the availability of the open multilingual knowledge graph ConceptNet, by using it as an additional external resource in Named Entity Recognition (NER). Our proposed architecture involves BiLSTM layers combined with a CRF layer that was augmented with some features such as pre-trained word embedding layers and dropout layers. Moreover, apart from using word representations, we used also character-based representation to capture the morphological and the orthographic information. Our experiments and evaluations showed an improvement in the overall performance with +2.86 in the F1-measure. Commonsense reasonnig has been employed in other studies and NLP tasks but to the best of our knowledge, there is no study relating the integration of a commonsense knowledge base in NER.