LinkTransformer: A Unified Package for Record Linkage with Transformer Language Models

Abhishek Arora, Melissa Dell


Abstract
Many computational analyses require linking information across noisy text datasets. While large language models (LLMs) offer significant promise, approximate string matching packages in popular statistical softwares such as R and Stata remain predominant in academic applications. These packages have simple interfaces and can be easily extended to a diversity of languages and settings, and for academic applications, ease-of-use and extensibility are essential. In contrast, packages for record linkage with LLMs require significant familiarity with deep learning frameworks and often focus on specialized applications of commercial value in English. The open-source package LinkTransformer aims to bridge this gap by providing an end-to-end software for performing record linkage and other data cleaning tasks with transformer LLMs, treating linkage as a text retrieval problem. At its core is an off-the-shelf toolkit for applying transformer models to record linkage. LinkTransformer contains a rich repository of pre-trained models for multiple languages and supports easy integration of any transformer language model from Hugging Face or OpenAI, providing the extensibility required for many scholarly applications. Its APIs also perform common data processing tasks, e.g., aggregation, noisy de-duplication, and translation-free cross-lingual linkage. LinkTransformer contains comprehensive tools for efficient model tuning, allowing for highly customized applications, and users can easily contribute their custom-trained models to its model hub to ensure reproducibility. Using a novel benchmark dataset geared towards academic applications, we show that LinkTransformer - with both custom models and Hugging Face or OpenAI models off-the-shelf - outperforms string matching by a wide margin. By combining transformer LMs with intuitive APIs, LinkTransformer aims to democratize these performance gains for those who lack familiarity with deep learning frameworks.
Anthology ID:
2024.acl-demos.21
Volume:
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Yixin Cao, Yang Feng, Deyi Xiong
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
221–231
Language:
URL:
https://aclanthology.org/2024.acl-demos.21
DOI:
10.18653/v1/2024.acl-demos.21
Bibkey:
Cite (ACL):
Abhishek Arora and Melissa Dell. 2024. LinkTransformer: A Unified Package for Record Linkage with Transformer Language Models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 221–231, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
LinkTransformer: A Unified Package for Record Linkage with Transformer Language Models (Arora & Dell, ACL 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.acl-demos.21.pdf