On Breadth Alone: Improving the Precision of Terminology Extraction Systems on Patent Corpora

Sean Nordquist; Adam Meyers

doi:10.18653/v1/2022.nllp-1.1

On Breadth Alone: Improving the Precision of Terminology Extraction Systems on Patent Corpora

Abstract

Automatic Terminology Extraction (ATE) methods are a class of linguistic, statistical, machine learning or hybrid techniques for identifying terminology in a set of documents. Most modern ATE methods use a statistical measure of how important or characteristic a potential term is to a foreground corpus by using a second background corpus as a baseline. While many variables with ATE methods have been carefully evaluated and tuned in the literature, the effects of choosing a particular background corpus over another are not obvious. In this paper, we propose a methodology that allows us to adjust the relative breadth of the foreground and background corpora in patent documents by taking advantage of the Cooperative Patent Classification (CPC) scheme. Our results show that for every foreground corpus, the broadest background corpus gave the worst performance, in the worst case that difference is 17%. Similarly, the least broad background corpus gave suboptimal performance in all three experiments. We also demonstrate qualitative differences between background corpora – narrower background corpora tend towards more technical output. We expect our results to generalize to terminology extraction for other legal and technical documents and, generally, to the foreground/background approach to ATE.

Anthology ID:: 2022.nllp-1.1
Volume:: Proceedings of the Natural Legal Language Processing Workshop 2022
Month:: December
Year:: 2022
Address:: Abu Dhabi, United Arab Emirates (Hybrid)
Editors:: Nikolaos Aletras, Ilias Chalkidis, Leslie Barrett, Cătălina Goanță, Daniel Preoțiuc-Pietro
Venue:: NLLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1–11
Language:
URL:: https://aclanthology.org/2022.nllp-1.1/
DOI:: 10.18653/v1/2022.nllp-1.1
Bibkey:
Cite (ACL):: Sean Nordquist and Adam Meyers. 2022. On Breadth Alone: Improving the Precision of Terminology Extraction Systems on Patent Corpora. In Proceedings of the Natural Legal Language Processing Workshop 2022, pages 1–11, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
Cite (Informal):: On Breadth Alone: Improving the Precision of Terminology Extraction Systems on Patent Corpora (Nordquist & Meyers, NLLP 2022)
Copy Citation:
PDF:: https://aclanthology.org/2022.nllp-1.1.pdf

PDF Cite Search Fix data