A Study on Terminology Extraction Based on Classified Corpora

Yirong Chen; Qin Lu; Wenjie Li; Zhifang Sui; Luning Ji

A Study on Terminology Extraction Based on Classified Corpora

Yirong Chen, Qin Lu, Wenjie Li, Zhifang Sui, Luning Ji

Abstract

Algorithms for automatic term extraction in a specific domain should consider at least two issues, namely Unithood and Termhood (Kageura, 1996). Unithood refers to the degree of a string to occur as a word or a phrase. Termhood (Chen Yirong, 2005) refers to the degree of a word or a phrase to occur as a domain specific concept. Unlike unithood, study on termhood is not yet widely reported. In classified corpora, the class information provides the cue to the nature of data and can be used in termhood calculation. Three algorithms are provided and evaluated to investigate termhood based on classified corpora. The three algorithms are based on lexicon set computing, term frequency and document frequency, and the strength of the relation between a term and its document class respectively. Our objective is to investigate the effects of these different termhood measurement features. After evaluation, we can find which features are more effective and also, how we can improve these different features to achieve the best performance. Preliminary results show that the first measure can effectively filter out independent terms or terms of general use.

Anthology ID:: L06-1238
Volume:: Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
Month:: May
Year:: 2006
Address:: Genoa, Italy
Editors:: Nicoletta Calzolari, Khalid Choukri, Aldo Gangemi, Bente Maegaard, Joseph Mariani, Jan Odijk, Daniel Tapias
Venue:: LREC
SIG:
Publisher:: European Language Resources Association (ELRA)
Note:
Pages:
Language:
URL:: http://www.lrec-conf.org/proceedings/lrec2006/pdf/407_pdf.pdf
DOI:
Bibkey:
Cite (ACL):: Yirong Chen, Qin Lu, Wenjie Li, Zhifang Sui, and Luning Ji. 2006. A Study on Terminology Extraction Based on Classified Corpora. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06), Genoa, Italy. European Language Resources Association (ELRA).
Cite (Informal):: A Study on Terminology Extraction Based on Classified Corpora (Chen et al., LREC 2006)
Copy Citation:
PDF:: http://www.lrec-conf.org/proceedings/lrec2006/pdf/407_pdf.pdf

PDF Cite Search Fix data