Saurabh Sohoney


2024

pdf bib
Pretraining and Finetuning Language Models on Geospatial Networks for Accurate Address Matching
Saket Maheshwary | Arpan Paul | Saurabh Sohoney
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track

We propose a novel framework for pretraining and fine-tuning language models with the goal of determining whether two addresses represent the same physical building. Address matching and building authoritative address catalogues are important to many applications and businesses, such as delivery services, online retail, emergency services, logistics, etc. We propose to view a collection of addresses as an address graph and curate inputs for language models by placing geospatially linked addresses in the same context. Our approach jointly integrates concepts from graph theory and weak supervision with address text and geospatial semantics. This integration enables us to generate informative and diverse address pairs, facilitating pretraining and fine-tuning in a self-supervised manner. Experiments and ablation studies on manually curated datasets and comparisons with state-of-the-art techniques demonstrate the efficacy of our approach. We achieve a 24.49% improvement in recall while maintaining 95% precision on average, in comparison to the current baseline across multiple geographies. Further, we deploy our proposed approach and show the positive impact of improving address matching on geocode learning.

2022

pdf bib
Learning Geolocations for Cold-Start and Hard-to-Resolve Addresses via Deep Metric Learning
Govind | Saurabh Sohoney
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track

With evergrowing digital adoption in the society and increasing demand for businesses to deliver to customers doorstep, the last mile hop of transportation planning poses unique challenges in emerging geographies with unstructured addresses. One of the crucial inputs to facilitate effective planning is the task of geolocating customer addresses. Existing systems operate by aggregating historical delivery locations or by resolving/matching addresses to known buildings and campuses to vend a high-precision geolocation. However, by design they fail to cater to a significant fraction of addresses which are new in the system and have inaccurate or missing building level information. We propose a framework to resolve these addresses (referred to as hard-to-resolve henceforth) to a shallower granularity termed as neighbourhood. Specifically, we propose a weakly supervised deep metric learning model to encode the geospatial semantics in address embeddings. We present empirical evaluation on India (IN) and the United Arab Emirates (UAE) hard-to-resolve addresses to show significant improvements in learning geolocations i.e., 22% (IN) & 55% (UAE) reduction in delivery defects (where learnt geocode is Y meters away from actual location), and 43% (IN) & 90% (UAE) reduction in 50th percentile (p50) distance between learnt and actual delivery locations over the existing production system.

2010

pdf bib
All Words Domain Adapted WSD: Finding a Middle Ground between Supervision and Unsupervision
Mitesh Khapra | Anup Kulkarni | Saurabh Sohoney | Pushpak Bhattacharyya
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

pdf bib
CFILT: Resource Conscious Approaches for All-Words Domain Specific WSD
Anup Kulkarni | Mitesh Khapra | Saurabh Sohoney | Pushpak Bhattacharyya
Proceedings of the 5th International Workshop on Semantic Evaluation

pdf bib
Value for Money: Balancing Annotation Effort, Lexicon Building and Accuracy for Multilingual WSD
Mitesh Khapra | Saurabh Sohoney | Anup Kulkarni | Pushpak Bhattacharyya
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)