Debashis Mukherjee


2024

pdf bib
GeoIndia: A Seq2Seq Geocoding Approach for Indian Addresses
Bhavuk Singhal | Anshu Aditya | Lokesh Todwal | Shubham Jain | Debashis Mukherjee
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track

Geocoding, the conversion of unstructured geographic text into structured spatial data, is essential for logistics, urban planning, and location-based services. Indian addresses with their diverse languages, scripts, and formats present significant challenges that existing geocoding methods often fail to address, particularly at fine-grained resolutions. In this paper, we propose GeoIndia, a novel geocoding system designed specifically for Indian addresses using hierarchical H3-cell prediction within a Seq2Seq framework. Our methodology includes a comprehensive analysis of Indian addressing systems, leading to the development of a data correction strategy that enhances prediction accuracy. We investigate two model architectures, Flan-T5-base (T5) and Llama-3-8b (QLF-Llama-3), due to their strong sequence generation capabilities. We trained around 29 models with one dedicated to each state, and results show that our approach provides superior accuracy and reliability across multiple Indian states, outperforming the well-renowned geocoding platform Google Maps. In multiple states, we achieved more than an 50% reduction in mean distance error and more than a 85% reduction in 99th percentile distance error compared to Google Maps. This advancement can help in optimizing logistics in the e-commerce sector, reducing delivery failures and improving customer satisfaction.