Building a Bilingual Vietnamese-French Named Entity Annotated Corpus through Cross-Linguistic Projection

Ngoc Tan Le, Fatiha Sadat


Abstract
The creation of high-quality named entity annotated resources is time-consuming and an expensive process. Most of the gold standard corpora are available for English but not for less-resourced languages such as Vietnamese. In Asian languages, this task is remained problematic. This paper focuses on an automatic construction of named entity annotated corpora for Vietnamese-French, a less-resourced pair of languages. We incrementally apply different cross-projection methods using parallel corpora, such as perfect string matching and edit distance similarity. Evaluations on Vietnamese –French pair of languages show a good accuracy (F-score of 94.90%) when identifying named entities pairs and building a named entity annotated parallel corpus.
Anthology ID:
2015.jeptalnrecital-demonstration.6
Volume:
Actes de la 22e conférence sur le Traitement Automatique des Langues Naturelles. Démonstrations
Month:
June
Year:
2015
Address:
Caen, France
Editors:
Jean-Marc Lecarpentier, Nadine Lucas
Venue:
JEP/TALN/RECITAL
SIG:
Publisher:
ATALA
Note:
Pages:
12–13
Language:
URL:
https://aclanthology.org/2015.jeptalnrecital-demonstration.6
DOI:
Bibkey:
Cite (ACL):
Ngoc Tan Le and Fatiha Sadat. 2015. Building a Bilingual Vietnamese-French Named Entity Annotated Corpus through Cross-Linguistic Projection. In Actes de la 22e conférence sur le Traitement Automatique des Langues Naturelles. Démonstrations, pages 12–13, Caen, France. ATALA.
Cite (Informal):
Building a Bilingual Vietnamese-French Named Entity Annotated Corpus through Cross-Linguistic Projection (Le & Sadat, JEP/TALN/RECITAL 2015)
Copy Citation:
PDF:
https://aclanthology.org/2015.jeptalnrecital-demonstration.6.pdf