Pretraining and Finetuning Language Models on Geospatial Networks for Accurate Address Matching

Saket Maheshwary; Arpan Paul; Saurabh Sohoney

doi:10.18653/v1/2024.emnlp-industry.58

Pretraining and Finetuning Language Models on Geospatial Networks for Accurate Address Matching

Saket Maheshwary, Arpan Paul, Saurabh Sohoney

Abstract

We propose a novel framework for pretraining and fine-tuning language models with the goal of determining whether two addresses represent the same physical building. Address matching and building authoritative address catalogues are important to many applications and businesses, such as delivery services, online retail, emergency services, logistics, etc. We propose to view a collection of addresses as an address graph and curate inputs for language models by placing geospatially linked addresses in the same context. Our approach jointly integrates concepts from graph theory and weak supervision with address text and geospatial semantics. This integration enables us to generate informative and diverse address pairs, facilitating pretraining and fine-tuning in a self-supervised manner. Experiments and ablation studies on manually curated datasets and comparisons with state-of-the-art techniques demonstrate the efficacy of our approach. We achieve a 24.49% improvement in recall while maintaining 95% precision on average, in comparison to the current baseline across multiple geographies. Further, we deploy our proposed approach and show the positive impact of improving address matching on geocode learning.

Anthology ID:: 2024.emnlp-industry.58
Volume:: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track
Month:: November
Year:: 2024
Address:: Miami, Florida, US
Editors:: Franck Dernoncourt, Daniel Preoţiuc-Pietro, Anastasia Shimorina
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 763–773
Language:
URL:: https://aclanthology.org/2024.emnlp-industry.58/
DOI:: 10.18653/v1/2024.emnlp-industry.58
Bibkey:
Cite (ACL):: Saket Maheshwary, Arpan Paul, and Saurabh Sohoney. 2024. Pretraining and Finetuning Language Models on Geospatial Networks for Accurate Address Matching. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 763–773, Miami, Florida, US. Association for Computational Linguistics.
Cite (Informal):: Pretraining and Finetuning Language Models on Geospatial Networks for Accurate Address Matching (Maheshwary et al., EMNLP 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.emnlp-industry.58.pdf

PDF Cite Search Fix data