JRC Eurovoc Indexer JEX - A freely available multi-label categorisation tool

Ralf Steinberger, Mohamed Ebrahim, Marco Turchi


Abstract
EuroVoc (2012) is a highly multilingual thesaurus consisting of over 6,700 hierarchically organised subject domains used by European Institutions and many authorities in Member States of the European Union (EU) for the classification and retrieval of official documents. JEX is JRC-developed multi-label classification software that learns from manually labelled data to automatically assign EuroVoc descriptors to new documents in a profile-based category-ranking task. The JEX release consists of trained classifiers for 22 official EU languages, of parallel training data in the same languages, of an interface that allows viewing and amending the assignment results, and of a module that allows users to re-train the tool on their own document collections. JEX allows advanced users to change the document representation so as to possibly improve the categorisation result through linguistic pre-processing. JEX can be used as a tool for interactive EuroVoc descriptor assignment to increase speed and consistency of the human categorisation process, or it can be used fully automatically. The output of JEX is a language-independent EuroVoc feature vector lending itself also as input to various other Language Technology tasks, including cross-lingual clustering and classification, cross-lingual plagiarism detection, sentence selection and ranking, and more.
Anthology ID:
L12-1519
Volume:
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Month:
May
Year:
2012
Address:
Istanbul, Turkey
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
798–805
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/875_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Ralf Steinberger, Mohamed Ebrahim, and Marco Turchi. 2012. JRC Eurovoc Indexer JEX - A freely available multi-label categorisation tool. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 798–805, Istanbul, Turkey. European Language Resources Association (ELRA).
Cite (Informal):
JRC Eurovoc Indexer JEX - A freely available multi-label categorisation tool (Steinberger et al., LREC 2012)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/875_Paper.pdf