EXTRACTING PARALLEL PHRASES FROM COMPARABLE ENGLISH AND PUNJABI CORPORA USING AN INTEGRATED APPROACH

Manpreet Singh Lehal, Vishal Goyal


Abstract
Machine translation from English to Indian languages is always a difficult task due to the unavailability of a good quality corpus and morphological richness in the Indian languages. For a system to produce better translations, the size of the corpus should be huge. We have employed three similarity and distance measures for the research and developed a software to extract parallel data from comparable corpora automatically with high precision using minimal resources. The software works upon four algorithms. The three algorithms have been used for finding Cosine Similarity, Euclidean Distance Similarity and Jaccard Similarity. The fourth algorithm is to integrate the outputs of the three algorithms in order to improve the efficiency of the system.
Anthology ID:
2020.icon-demos.4
Volume:
Proceedings of the 17th International Conference on Natural Language Processing (ICON): System Demonstrations
Month:
DECEMBER
Year:
2020
Address:
Patna, India
Editors:
Vishal Goyal, Asif Ekbal
Venue:
ICON
SIG:
Publisher:
NLP Association of India (NLPAI)
Note:
Pages:
10–12
Language:
URL:
https://aclanthology.org/2020.icon-demos.4
DOI:
Bibkey:
Cite (ACL):
Manpreet Singh Lehal and Vishal Goyal. 2020. EXTRACTING PARALLEL PHRASES FROM COMPARABLE ENGLISH AND PUNJABI CORPORA USING AN INTEGRATED APPROACH. In Proceedings of the 17th International Conference on Natural Language Processing (ICON): System Demonstrations, pages 10–12, Patna, India. NLP Association of India (NLPAI).
Cite (Informal):
EXTRACTING PARALLEL PHRASES FROM COMPARABLE ENGLISH AND PUNJABI CORPORA USING AN INTEGRATED APPROACH (Lehal & Goyal, ICON 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.icon-demos.4.pdf