Multi-Level Gazetteer-Free Geocoding

We present a multi-level geocoding model (MLG) that learns to associate texts to geographic coordinates. The Earth’s surface is represented using space-filling curves that decompose the sphere into a hierarchical grid. MLG balances classification granularity and accuracy by combining losses across multiple levels and jointly predicting cells at different levels simultaneously. It obtains large gains without any gazetteer metadata, demonstrating that it can effectively learn the connection between text spans and coordinates—and thus makes it a gazetteer-free geocoder. Furthermore, MLG obtains state-of-the-art results for toponym resolution on three English datasets without any dataset-specific tuning.


Introduction
Geocoding is the task of resolving location references in text to geographic coordinates or regions. It is often studied in social networks, where metadata and the network itself provide additional non-textual signals (Backstrom et al., 2010;Rahimi et al., 2015). If locations can be mapped to an entity in a knowledge graph, toponym resolution -a special case of entity resolution -can be used to resolve references to locations. Past work used heuristics based on location popularity (Leidner, 2007) and distance between candidate locations (Speriosu and Baldridge, 2013), as well as learned associations from text to locations. However, such approaches have a strong bias for highly-populated locations, especially for social media.
We present Multi-Level Geocoder (MLG, Fig. 1), a model that learns spatial language representations and maps toponyms to coordinates on Earth's surface. This geocoder is not restricted to resolving toponyms to specific location entities, but rather to geo-coordinates directly. MLG can thus be extended to any arbitrary location references in future without having to rely on its presence in the gazetteer. For comparative evaluation, we use three English toponym resolution datasets from * Equal contribution † Work done during internship at Google The five boroughs -Brooklyn, Queens, Manhattan, the Bronx and Staten Islandwere consolidated into a single city in 1898.
Joint loss Figure 1: Overview of Multi-Level Geocoder, using multiple context features and jointly predicting cells at multiple levels of the S2 hierarchy.
distinct textual domains. MLG shows strong performance, even without gazetteer and population metadata. MLG is a text-to-location neural geocoder. We represent the locations using S2 geometry 1 -a hierarchical discretization of the Earth's surface based on spacefilling curves. S2 naturally supports spatial representation at multiple levels, including very fine grained cells (as small as 1cm 2 at level 30). Here, we use combinations of levels 4 (∼300K km 2 ) to 8 (∼1K km 2 ). Large cells are easy to predict accurately; however, they are too coarse on their own, and perform poorly on metrics that consider error distances. Smaller cells improve granularity but result in larger and harder output spaces with less training evidence per cell. MLG balances classification granularity and accuracy by predicting at multiple S2 levels and jointly optimizing for the loss at each level. Fig. 1 shows an area around New York City covered by cell id 0x89c25 at level 8 and 0x89c4 at level 5. This is more fine-grained than previous work that does text-to-location geocoding (Gritta et al., 2018a), which uses arbitrary square-degree cells, e.g. 2 • -by-2 • cells (∼48K km 2 ).
Unlike previous work that relies on external gazetteer information, MLG is more flexible and can predict geolocation only from context. For instance, it predicts the location of Manhattan from the surrounding words (The five boroughs -Brooklyn, Queens, the Bronx and Staten Island -. . . ). Earlier approaches instead relied on a knowledge graph that had Manhattan as an entity. While the hierarchical geolocation model of Wing and Baldridge (2014) over kd-trees has some more finegrained cells, MLG predicts over a much larger set of smaller cells. Furthermore, MLG is a single model that jointly incorporates multiple levels rather than ensembling independent per-cell models for each level.
Our main contributions are the following.
• We define MLG, a model that jointly predicts cells at multiple levels, including finer-grained cells than previous work. • We show that S2 provides a strong and standardized hierarchical discretization of the Earth's surface for cell-based geocoders. • We show that it is possible and even preferable to eschew gazetteer metadata. In particular, our experiments show that this strategy generalizes much better. • We show state-of-the-art performance on three English datasets without any fine-tuning. • When analyzing these datasets, we found inconsistencies in the true coordinates that we unify to support consistent evaluation. 2

Spatial representations
Geocoders map text spans to geo-coordinates-a prediction over a continuous space representing the surface of a sphere. We relax the problem from continuous space to discrete space by quantizing the Earth's surface as a grid and performing multi-class prediction over the grid's cells. We construct a hierarchical grid using the S2 library. 3 S2 projects the six faces of a cube onto the Earth's surface and each face is recursively divided into 4 quadrants, as shown in Figure 1. Cells at each level are indexed using a Hilbert curve. Each S2 cell is represented as a 64-bit unsigned integer and can correspond to areas as small as ≈1cm 2 . S2 cells preserve cell size across the globe better than commonly-used degreesquare grids (e.g. 1 (Serdyukov et al., 2009;Wing and Baldridge, 2011). Hierarchical triangular meshes (Szalay et al., 2007) and Hierarchical Equal Area iso-Latitude Pixelation (Melo and Martins, 2015) are alternatives that preserve cell size better, but S2 is easier to work with and has strong, standard tooling.
Our experiments go as far as S2 level eight (of thirty), but our approach is extendable to any level of granularity and could support very fine-grained locations like buildings and landmarks. The built-in hierarchical nature of S2 cells makes it well suited as a scaffold for models that learn and combine evidence from multiple levels. This combines the best of both worlds: specificity at finer levels and aggregation/smoothing at coarser levels. Roller et al. (2012) (Kamalloo and Rafiei, 2018). Polygons for geopolitical entities such as city, state, and country  are perhaps ideal, but these too require detailed metadata for all toponyms, managing non-uniformity of the polygons, and general facility with GIS tools. The Point-to-City (P2C) method applies an iterative k-d tree-based method for clustering coordinates and associating them with cities (Fornaciari and Hovy, 2019b). S2 can represent such hierarchies in various levels without relying on external metadata.
In accordance with the nature of the problem over continuous space, studies using bivariate Gaussians on multiple flattened regions (Eisenstein et al., 2010;Priedhorsky et al., 2014)) perform well on distance based metrics, but this involves difficult trade-offs between flattened region sizes and the level of distortion they introduce. Some of the early models used with grid-based representations were probabilistic language models that produce document likelihoods in different geospatial cells (Serdyukov et al., 2009;Wing and Baldridge, 2011;Dias et al., 2012;Roller et al., 2012). Extensions include domain adapting language models from various sources (Laere et al., 2014), hierarchical discriminative models (Wing and Baldridge, 2014;Melo and Martins, 2015), and smoothing sparse grids with Gaussian priors (Hulden et al., 2015). Alternatively, Fornaciari and Hovy (2019a) use a multi-task learning setup that assigns probabilities across grids and also predicts the true location through regression. Melo and Martins (2017) cover a broad survey of document geocoding. Much of this work has been conducted on social media data like Twitter, where additional information beyond the text-such as the network connections and user and document metadata-have been used (Backstrom et al., 2010;Cheng et al., 2010;Han et al., 2014;Rahimi et al., 2015Rahimi et al., , 2016Rahimi et al., , 2017. MLG is not trained on social media data and hence, does not need additional network information. Further, the data does not have a character limit like tweets, so models can learn from long text sequences.

Inference
Finland is a Nordic country in Northern Europe bordering the Baltic Sea, Gulf of Bothnia, and Gulf of Finland, between Sweden to the west, Russia to the east, Estonia to the south, and north-eastern Norway to the north.

Multi-Level Geocoder (MLG)
Multi-Level Geocoder (MLG, Figure 2) is a text-tolocation CNN-based geocoder. Context features are similar to CamCoder (Gritta et al., 2018a) but we exclude its metadata-based MapVec feature. Locations are represented using a hierarchical S2 grid; this enables joint multi-level prediction, by optimizing for total loss computed from all levels.

Prior geocoding models
Toponym resolution identifies place mentions in text and predicting the precise geo-entity in a knowledge base (Leidner, 2007;Gritta et al., 2018b). The knowledge base is then used to obtain the geo-coordinates of the predicted entity for the geocoding task. Rule-based toponym resolvers (Smith and Crane, 2001;Karimzadeh et al., 2013) rely on hand-built heuristics like population from metadata resources like Wikipedia and GeoNames 4 gazetteer. This works well for many common places, but it is brittle and cannot handle unknown or uncommon place names. As such, machine learned approaches that use toponym context features have demonstrated better performance (Speriosu and Baldridge, 2013;Zhang and Gelernter, 2014;DeLozier et al., 2015;Santos et al., 2015). A straightforward-but data hungry-approach learns a collection of multi-class classifiers, one per toponym with a gazetteer's locations for the toponym as the classes (e.g., the WISTR model of Speriosu and Baldridge (2013)). A hybrid approach that combines learning and heuristics by predicting a distribution over the grid cells and then filtering the scores through a gazetteer works for systems like TRIPDL (Speriosu and Baldridge, 2013) and TopoCluster (DeLozier et al., 2015). A combination of classification and regression loss to predict over recursively partitioned regions shows promising results 4 www.geonames.org with in-domain training (Cardoso et al., 2019). Cam-Coder (Gritta et al., 2018a) uses this strategy with a much stronger neural model and achieves state-of-theart results. It incorporates side metadata in the form of its MapVec feature vector, which encodes knowledge of potential locations and their populations matching all toponym in the text. It thus uses population signals in both the MapVec feature in training and in output predictions biasing the predictions toward locations with larger populations.

Building blocks
MLG uses a convolutional neural network to map input text to S2 cells at a given granularity.
Input MLG extracts three features from the input context: (a) token sequence (w a,1:la ) is all the tokens in input, (b) toponym mentions (w b,1:l b ) is the list of all locations words in the context, and (c) surface form of the target toponym (w c,1:lc ) that is to be geo-located. All text inputs are transformed uniformly, using shared model parameters. Let input text content be denoted as a word sequence w x,1:l = [w x,1 , . . . , w x,l ], initialized using GloVe embeddings φ(w x,1:l ) = [φ(w x,1 ), . . . , φ(w x,l )] (Pennington et al., 2014).
Consider a short context for Manhattan as "Manhattan is the smallest and most densely populated borough compared to others -Bronx, Brooklyn, Queens, and Staten Island." All tokens are lower cased and we get w a as ["is", "the", "smallest", "and", ...], toponym mentions w b are ["bronx", ... , "staten", "island"], and surface form of target toponym w c would be "manhattan".
These projections are concatenated to form the full input representation. MLG is designed to study effectiveness of spatial language representation without any gazetteer information. Hence we choose a CNN-based architecture, but can be extended to large scale pretrained language models (Devlin et al. (2018)).
Output An S2 cell is predicted at the highest granularity using a softmax over the output space. The center of the predicted S2 cell is taken as the predicted coordinates. Optionally, the predicted cells may be snapped to the closest valid cells that overlap the potential gazetteer locations for the toponym, weighted by their population (similar to previous work, like CamCoder).

Multi-level classification
MLG's core block is a multi-class classifier using a CNN. Rather than predicting cells at a single level, we project the output onto multiple levels with a multiheaded model. The penultimate layer maps representations of the input to probabilities over the finest-grained cells. Gradient updates are computed using cross entropy loss between predicted probabilities p and the one-hot true class vector c.
MLG exploits the natural hierarchy of geographic locations by jointly predicting at different levels of granularity. CamCoder uses 7.8K output classes representing 2x2 degree tiles (after filtering cells that have no support in training, such as over bodies of water, to limit the class space). This requires maintaining a cumbersome mapping between actual grid cells and the classes. MLG's multi-level hierarchical representation overcomes this problem by including coarser levels (like L5) to guide the predictions at finer-grained levels. We focus on three levels that are appropriate for the task: L5, L6 and L7 (shown in Table 1), each giving 6K, 24K, and 98K output classes, respectively. We define losses at each level (L5, L6, L7) and minimize them jointly, i.e., L total = (L(p L5 , c L5 ) + L(p L6 , c L6 ) + L(p L7 , c L7 ))/3. At inference time, a single forward pass computes probabilities at all three levels. The final score for each L7 cell is dependent on its predicted probability as well as the probabilities in its corresponding parent L6 and L5 cells. Then the final score for s L7 (f ) = p L7 (f ) * p L6 (e) * p L5 (d) and the final prediction isŷ = argmax y s L7 (y). This approach is easily extensible to capture additional levels of resolution-we also present results with finer resolution at L8, with ∼1K km 2 area and coarser resolution at L4 with ∼300K km 2 area for comparison.

Gazetteer-constrained prediction
The only way MLG uses geographic information is from training labels for toponym targets. At test time, MLG predicts a distribution over all cells at each S2 level given the input features and picks the highest probability cell at the most granular level. We use the center of the cell as predicted coordinates. However, when the goal is to resolve a specific toponym, an effective heuristic is to use a gazetteer to filter the output predictions to only those that are valid for the toponym. Furthermore, gazetteers come with population information that can be used to nudge predictions toward locations with high populations-which tend to be discussed more than less populous alternatives. Like DeLozier et al. (2015), we consider both gazetteer-free and gazetteer-constrained predictions.
Gazetteer-constrained prediction makes toponym resolution a sub-problem of entity resolution. As with broader entity resolution, a strong baseline is an alias table (the gazetteer) with a popularity prior. For geographic data, the population of each location is an effective quantity for characterizing popularity: choosing Paris, France rather than Paris, Texas for the toponym Paris is a better bet. This is especially true for zero-shot evaluation where one has no in-domain training data.
We follow the strategy of Gritta et al. (2018a) for gazetteer constrained predictions. We construct an alias table which maps each mention m to a set of candidate locations, denoted by C(m) using link information from Wikipedia and the population pop( ) for each location is read from WikiData. 5 For each of the gazetteer's candidate locations we compute a population discounted distance from the geocoder's predicted location p and choose the one with smaller value as argmin ∈C(m) dist(p, ) · (1 − c · pop( )/ pop(m)).
Here, pop(m) is the maximum population among all candidates for mention m, dist(p, ) is the great circle distance between prediction p and location , and c is a constant in [0, 1] that indicates the degree of population bias applied. For c=0, the location nearest the prediction is chosen (ignoring population); for c=1, the most populous location is chosen, (ignoring p). This is set to 0.9, which worked best on the development set.

Training Data and Representation
MLG is trained on geographically annotated Wikipedia pages, excluding all pages in WikToR (see Sec. 4.1). For each page with geo-coordinates, we consider context windows of up to 400 tokens (respecting sentence boundaries) as training example candidates. Only context windows that contain the target Wikipedia toponym are used. We use Google Cloud Natural Language API libraries to tokenize 6 the page text and for identifying 7 toponyms in the contexts. We use the July 2019 English Wikipedia dump, which has 1.11M location annotated pages giving 1.76M training examples. This is split 90/10 for training/development.

Evaluation
We train MLG as a general purpose geocoder and evaluate it on toponym resolution. A strong baseline is to choose the most populous candidate location (POPBASELINE): i.e. argmax ∈C(m) pop( )
LGL consists of 588 news articles from 78 different news sources. This dataset contains 5,088 toponyms and 41% of these refer to locations with small populations. About 16% of the toponyms are for street names, which do not have coordinates; and hence dropped from our evaluation set. About 2% have an entity that does not exist in Wikipedia, which were also dropped thus leaving 4,172 examples for evaluation. GeoVirus (GV) is based on 229 WikiNews 8 articles about global epidemics obtained using keywords such as "Bird Flu" and "Ebola". Place mentions are manually tagged and assigned Wikipedia page URLs. In total, this dataset provides 2,167 toponyms for evaluation. WikToR serves as in-domain Wikipedia-based evaluation data, while both LGL and GeoVirus provide outof-domain news corpora evaluation.

Unified evaluation sets
We use the publicly available versions of the three datasets used in CamCoder. 9 However, after analyzing examples across all of them, we identified inconsistencies in location target coordinates.
First, WikToR's evaluation set delivers annotations based on GeoNames DB and Wikipedia APIs. We discovered that WikToR was annotated with an older version of GeoNames DB, which has a known issue of sign flip in either latitude or longitude of some locations. For example, Santa Cruz, New Mexico was incorrectly tagged as (35, 106) instead of (35, -106). This affects 296 out of 5,000 locations in WikToR-mostly cities in the United States and a few in Australia.
Second, the target coordinates are inconsistent across the 3 datasets. For example, Canada is (60.0, -95.0) in GeoVirus, (60.0, -96.0) in LGL and (45.4,.7) in Wik-ToR. Given our point-based representations, we need consistent coordinates across the evaluation sets. So we re-annotated all three datasets to unify the coordinates for target toponyms. 2 This was done Wikidata to be consistent with Wikipedia training labels.

Evaluation Metrics
We use three metrics for evaluation: AUC for the error curve, accuracy@161km and mean distance error. AUC 10 is the area under the discrete curve of sorted logerror distances. This is captures the entire distribution of errors and is not sensitive to outliers. It uses the log of the error distances, which appropriately focuses the metric on smaller error distances. Accuracy is the percentage of toponyms that are resolved to within 161km  37  55  54  49  91  51  51  64  197 1529 1570 1099  MLG 5-7  7.25  37  54  55  49  91  53  49  64  180 1407 1690 1092  MLG 5-8  13.28  38  58  67  54  89  45  24  53  272 1866 3058 1732   Table 4: Models trained with different granularities help trade-off between accuracy and generalization. Selected model MLG 5-7 is based on optimal performance of the holdout.
(100 miles) of their true location. Mean distance error is the average of all distances between predicted locations (center of the predicted S2 cell) and true locations of the target toponym.
We study the benefits of resolving toponyms over multiple levels to account for the range of populations, resolution ambiguity, topological shapes and sizes of different toponyms. We leave the shaping of the output space as future work (e.g., using geopolitical polygons instead of points).

Training
MLG is trained using TensorFlow (Abadi et al., 2016) distributed across 13 P100 GPUs. Each training batch processes 512 examples. The model trains up to 1M steps, although they converge around 500K steps. We found an optimal initial learning rate of 10 −4 decaying exponentially over batches after initial warm-up. For optimization, we use Adam (Kingma and Ba, 2015) for stability.
We considered S2 levels 4 through 8, including single level (SLG) and multi-level (MLG) variations. MLG's architecture offers the flexibility of doing multi-level training but performing prediction with just one level. Based on the loss on Wikipedia development split, we chose multi-level training and prediction with levels 5, 6 and 7.
We stress that our focus is geocoding without gazetteer information at inference time. However, we also show that additional gains can be achieved using gazetteers to select relevant cells for a given toponym, and scale the output using the population bias (c) as described in section 3.4. Overall trends The most striking result is MLG's improvement over CAMCODER without gazetteer filtering, especially on WikToR-a dataset specifically designed to counteract population priors. MLG clearly generalizes better by leaving out the non-lexical MapVec fea-ture and thereby avoiding the influence of its population bias for the toponyms in the context. Fine-grained multi-level learning and prediction pays off, both with and without gazetteer filtering. This is particularly clear with AUC, where MLG is 6% better (averaged over all datasets) than CAMCODER with the gazetteer filter. Without the filter, MLG has an even larger gain of 9%.

Results
Generalization When not using the gazetteer filter, MLG actually beats the population baseline for Wik-ToR, and it is much closer to the strong population baselines for LGL and GeoVirus than CAMCODER and SLG. This indicates that the multi-level approach allows the use of training evidence to generalize better over examples drawn globally (entire world in GeoVirus) as well as locally (the United States of America in LGL).
Multi-level prediction helps. Table 3 compares performance of using individual levels from the same MLG model trained on levels L5, L6 and L7 (without the gazetteer filter). The trade off of predicting at different granularity is clear: when we use lower granularity, e.g. L5 cells, our model can generalize better, but it may be less precise given the large size of the cells. On the other hand, when using finer granularity, e.g. L7 cells, the model can be more accurate in dense regions, but could suffer in sparse regions where there is less training data. Combining the predictions from all levels balances the strengths effectively. Table  4 shows performance of MLG by training and predicting with multiple levels at different granularities. Overall, using levels five through seven (which has the best development split loss) provides the strongest balance between generalization and specificity. For locating cities, states and countries, especially when choosing from candidate locations in a gazetteer, L8 cells do not provide much greater precision than L7 and suffer from fewer examples as evidence in each cell.

Levels five through seven offer best tradeoff
Qualitative examples An effective use of context in correctly predicting coordinates is shown in Table 5 on two examples, Arlington and Lincoln. In both pairs, the context helps to shift the predictions in the right regions on the map. It is not biased by just the most populous place. Here we only show a part of the context for clarity though the actual context is longer (see Sec.

3.5).
Arlington is a former manor, village and civil parish in the North Devon district of Devon in England. The parish includes the villages of Arlington and Arlington Beccott. ...
Arlington is a city in Gilliam County, Oregon, United States. The account of how the city received its name varies; one tradition claims it was named after the lawyer Nathan Arlington Cornish, ...
Lincoln is a city in Logan County, Illinois, United States. It is the only town in the United States that was named for Abraham Lincoln before he became president....
Lincoln is a city in the province of Buenos Aires in Argentina. It is the capital of the district of Lincoln (Lincoln Partido). The district of Lincoln was established on ...  Table 6: Effect of ablating location features from the input to demonstrate their importance in MLG 5-7. Figure 3: Ablating all toponyms at inference time spreads out the probabilities (points lighted up all over the map) but can still correctly predict Arlington (England) purely from context.
Ablations Table 6 shows ablation of salient features at inference time, removing either the target toponym or all toponyms. While masking the target toponym does not change results much except for GeoVirus, masking all other toponyms degrades performance considerably. Nevertheless, it may still be possible with just the context words, which include other named entities, characteristics of the place, and location-focused words in few cases. For example, Arlington (England) can be geolocated after all toponyms are masked (Fig. 3), though the distribution is more spread out in this case.

Conclusion and Future work
MLG uses multi-level optimization for the inherently hierarchical problem of geocoding. With just textual inputs, we can predict the location of a target toponym with minimal to no metadata from gazetteer and outperform existing benchmark models. MLG can thus be used as a gazetteer-free geocoder, on inputs like historical texts (DeLozier et al., 2016). Further, the models generalize very well across domains, and thus can be used in real-time datasets like news feeds. The multilevel loss can be further refined by using approaches like hierarchical softmax (Morin and Bengio, 2005) to incorporate the conditional probabilities across layers more effectively. A natural extension would be to fine-tune large pretrained language models for the geocoding task. We expect that the potential value of this is orthogonal to the contribution of our multi-level loss and the use of S2 cells. Another future direction involves smoothing the label space during training to capture the relations among spatial close cells by defining the loss as a function of Earth mover's distance with approximations like Sinkhorn divergence. This would also enable shaping the output class space to polygons instead of points, which is more realistic for geographical regions.