A Dataset and Evaluation Framework for Complex Geographical Description Parsing

Egoitz Laparra, Steven Bethard


Abstract
Much previous work on geoparsing has focused on identifying and resolving individual toponyms in text like Adrano, S.Maria di Licodia or Catania. However, geographical locations occur not only as individual toponyms, but also as compositions of reference geolocations joined and modified by connectives, e.g., “. . . between the towns of Adrano and S.Maria di Licodia, 32 kilometres northwest of Catania”. Ideally, a geoparser should be able to take such text, and the geographical shapes of the toponyms referenced within it, and parse these into a geographical shape, formed by a set of coordinates, that represents the location described. But creating a dataset for this complex geoparsing task is difficult and, if done manually, would require a huge amount of effort to annotate the geographical shapes of not only the geolocation described but also the reference toponyms. We present an approach that automates most of the process by combining Wikipedia and OpenStreetMap. As a result, we have gathered a collection of 360,187 uncurated complex geolocation descriptions, from which we have manually curated 1,000 examples intended to be used as a test set. To accompany the data, we define a new geoparsing evaluation framework along with a scoring methodology and a set of baselines.
Anthology ID:
2020.coling-main.81
Volume:
Proceedings of the 28th International Conference on Computational Linguistics
Month:
December
Year:
2020
Address:
Barcelona, Spain (Online)
Editors:
Donia Scott, Nuria Bel, Chengqing Zong
Venue:
COLING
SIG:
Publisher:
International Committee on Computational Linguistics
Note:
Pages:
936–948
Language:
URL:
https://aclanthology.org/2020.coling-main.81
DOI:
10.18653/v1/2020.coling-main.81
Bibkey:
Cite (ACL):
Egoitz Laparra and Steven Bethard. 2020. A Dataset and Evaluation Framework for Complex Geographical Description Parsing. In Proceedings of the 28th International Conference on Computational Linguistics, pages 936–948, Barcelona, Spain (Online). International Committee on Computational Linguistics.
Cite (Informal):
A Dataset and Evaluation Framework for Complex Geographical Description Parsing (Laparra & Bethard, COLING 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.coling-main.81.pdf
Code
 egolaparra/geocode-data +  additional community code