Location Aware Modular Biencoder for Tourism Question Answering

Answering real-world tourism questions that seek Point-of-Interest (POI) recommendations is challenging, as it requires both spatial and non-spatial reasoning, over a large candidate pool. The traditional method of encoding each pair of question and POI becomes inefficient when the number of candidates increases, making it infeasible for real-world applications. To overcome this, we propose treating the QA task as a dense vector retrieval problem, where we encode questions and POIs separately and retrieve the most relevant POIs for a question by utilizing embedding space similarity. We use pretrained language models (PLMs) to encode textual information, and train a location encoder to capture spatial information of POIs. Experiments on a real-world tourism QA dataset demonstrate that our approach is effective, efficient, and outperforms previous methods across all metrics. Enabled by the dense retrieval architecture, we further build a global evaluation baseline, expanding the search space by 20 times compared to previous work. We also explore several factors that impact on the model's performance through follow-up experiments. Our code and model are publicly available at https://github.com/haonan-li/LAMB.


Introduction
Question answering (QA) models and recommender systems have undergone rapid development in recent years (Seo et al., 2017;Rajpurkar et al., 2016;Kwiatkowski et al., 2019;Lee et al., 2019;Cui et al., 2020;Hamid et al., 2021).However, personalised question answering is still highly challenging and relatively unexplored in the literature.Consider the example question in Figure 1, in the form of a real-world point-of-interest (POI) recommendation question from a travel forum.Answering such questions requires understanding of the question text with possibly explicit (e.g., in Dublin) Question: Hi! My wife and I are in our late thirties and going to be in Dublin on September 28 and 29th.We are staying in the Grafton Street area.Does anybody have any suggestions for some fairly priced restaurants with great food within walking distance of Grafton Street?Also, What about some good pubs with live local music?(I realize it is a Sun and Mon night and may be slow) Any suggestion would be appreciated!Thanks!Answer ID: 11_R_4392 Answer Name: The Porterhouse Central, 45-47 Nassau Street, Dublin.Figure 1: An example of real-world POI recommendation question from the TourismQA dataset (Contractor et al., 2021b).Colored text represents constraints relevant to recommending POIs.
or vague and ambiguous (e.g., within walking distance of Grafton Street) spatial constraints, as well as a fast indexing method that supports large-scale reasoning over both spatial and non-spatial (e.g., fairly priced restaurants) constraints.
Recently, there has been increased interest in geospatial QA.Most approaches focus on querying structured knowledge bases, based on translating natural language questions into structured queries, e.g., using SPARQL (Punjani et al., 2018;Li et al., 2021;Hamzei et al., 2022).Separately, Contractor et al. (2021b) introduced the task of answering POIseeking questions using geospatial metadata and reviews that describe POIs.In later work, they proposed a spatial-textual reasoning network that uses distance-aware question embeddings as input and encodes question-POI pairs using attention (Contractor et al., 2021a).However, as their model creates separate question embeddings for each POI, the inference cost increases linearly in the number of POIs, and the model is incompatible with large pre-trained models such as BERT (Devlin et al., 2019) or even medium-sized QA models such as BiDAF (Seo et al., 2017).
In this work, we address the question: can we build a more efficient POI recommendation system which supports the use of advanced pre-trained language models as the textual encoder?By presenting the Location aware modular bi-encoder ("LAMB") model.We use a bi-encoder architecture to encode questions and POIs separately, where the question encoder is a textual module and the POI encoder consists of a textual and a location module.By encoding them separately, we cast the task as a retrieval problem based on dense vector similarity between the question and each POI.For training, we combine each question with one positivelylabeled POI and multiple negatively-labeled POIs, and use contrastive learning to train the question encoder and POI encoder simultaneously, by maximizing the similarity between the question and positive POI.After training, we generate locationaware dense representations for all POIs using the POI encoder, and index them by city name and entity (POI) type.For inference, we use the question encoder to generate a location-aware representation, and rank the POIs using similarity.
Our contributions are four-fold: (1) we propose a location-aware modular bi-encoder model which fuses spatial and textual information; (2) we demonstrate that the proposed model outperforms the existing SOTA on a real-world tourism QA dataset, with huge improvements in training and inference efficiency; (3) we build new global evaluation baselines by expanding the search space 20× over local evaluation; and finally, (4) we analyse the influence of different training strategies and hyperparameters through extensive experiments.

Methodology
In this section, we first formulate the task, and then introduce the POI pre-processing method and the LAMB model.Finally, we describe the efficient training and inference strategies.

Task Formulation
Given a question q, the task is to find the most probable POI answer p from a candidate pool P , which satisfies spatial and non-spatial constraints in q.Each POI in P consists of a geocoordinates (lat, long) of the POI, the multigranularity location name (POI entity name, street, city, postcode), and a list of textual reviews = (r 1 , r 2 , ...r n ).It can be represented as p = ⟨coordinates, name, reviews⟩ (see Appendix A for an example).

POI Pre-processing
Reviews of POIs provide useful information to represent POIs, however, each candidate can have hundreds of reviews, the total length greatly exceeding the maximum token length of 512 tokens in general PLMs such as BERT.To choose more representative reviews, previous work (Contractor et al., 2021b) has clustered reviews into K clusters, and then represented the POI using the top-N sentences from each cluster based on distance from the cluster centroid, resulting in N × K sentences.However, this approach is potentially problematic as clusters can be of varying size and density, and outliers can affect the centroid.To keep representative reviews, K and N should not be too small, e.g., Contractor et al. (2021b,a) set N = K = 10.
In this paper, we adopt the SELSUM (Bražinskas et al., 2021) model, which consists of a selector to choose the M most representative reviews and a summarizer to generate a summary of the selected reviews.We use a model pre-trained on the AMA-SUM dataset, which includes verdicts, pros, and cons, and hundreds of reviews for more than 31,000 summarized Amazon products (see example in Appendix C).We compare the results using clustering, the selection module only, and the full SELSUM model in Appendix C. Our results show that using a 3-sentence summary for each POI achieves comparable results with a clustering approach that represents each POI via 100 sentences, and that using 10 sentences outperforms the clustering method.

Location Aware Modular Bi-encoder
LAMB (see Figure 2) uses a bi-encoder framework to encode questions and POIs.The question encoder is a textual module which takes question text as input, and outputs dense representations.The POI encoder consists of a textual module and a location module, where the textual module encodes a description and/or reviews associated with it, and the location module encodes the multi-granularity location names.The outputs of the textual and location modules are real-valued vectors, which are concatenated to represent a POI.Full details of the model are presented below.Textual Module We use two independent PLMs as the textual encoder for questions and POIs, using the [CLS] token representation as the output.For questions, we do not preprocess the question text, while for POIs, we concatenate the preprocessed reviews.
Location Module Spatial constraints are crucial in retrieving relevant POIs to a question.However, previous research has shown that PLMs perform poorly in encoding and reasoning over spatial data, especially for geolocation information (Scherrer and Ljubešić, 2021;Hofmann et al., 2022).To enhance the model's ability to capture geospatial information, we employ a location module that explicitly encodes the multi-granularity location name of a POI into a dense vector.We initialize the location module by choosing several transformer blocks from a PLM, and continue pre-training it to learn geo-coordinate-aware location name representations.The training object is designed to pull together pairs of encoded location representations if the locations are physically near each other, and push them apart if they are far from each other.Formally, for any three POIs (p 0 , p 1 , p 2 ), suppose the corresponding locations are (l 0 , l 1 , l 2 ), and the encoded representations are (h 0 , h 1 , h 2 ). Here representing the latitude and longitude of p i , with lat i ∈ [−90, 90] and long i ∈ [−180, 180], and h i is a vector.We choose p 0 to be an anchor location, and d i (i = 1, 2) ∈ [0, 1] to represent the normalized Haversine distance between l 0 and l i , representing the greater-circle distance between two points on a sphere.Similarly, s i (i = 1, 2) ∈ [0, 1] represents the cosine similarity between h 0 and h i .We use the triplet margin loss, and define the loss function as follows: In the first case, d 1 −d 2 > 0 means that p 2 is closer to p 0 than p 1 , and hence we structure the loss to learn a larger s 2 (= higher similarity between p 0 and p 2 ) and smaller s 1 (= lower similarity between p 0 and p 1 ).We set the difference between the two distances as a dynamic margin, which controls the rationally-valued similarity difference.
Question and POI Encoders As mentioned above, we use a separate textual encoding module E text P and location encoding module E loc P to encode each POI.These modules map the review text and location names to fixed-length vectors: We concatenate r text p and r loc p and then use a dense layer to fuse the representations together, resulting in the POI representation r p ∈ R 1×d : For questions, we similarly tried using separate text and location modules, and combining their outputs.However, we found that the text may contain distractor locations that should not be considered as spatial constraints, and that context is essential.(e.g., the place name Italy in question Hey I am from Italy, please suggest a restaurant in Berlin that suits my appetite.)Hence, we use a single textual module E text Q which directly maps the question text into representation r q ∈ R 1×d , of the same dimension as a POI.

Training and Inference
We train the two encoders simultaneously using contrastive learning.We input each question q i with one positive POI p + i and several negative POIs p − i,1 , ...p − i,n into the model, with the objective to maximize the similarity between the embeddings of q i and p + i , while minimizing the similarity between the embeddings of q i and p − i,1 , ...p − i,n .We use the negative log-likelihood (NLL) loss of the positive POIs as our objective function: e sim(q i ,p + i ) + n j=1 e sim(q i ,p − i,j ) where similarity function sim(p, q) is the inner product.
Negative Sampling Strategy A critical question in contrastive learning is how to construct positive and negative examples.In our case, for each question, there can be more than one answer (= positive) POI.To make use of every positive POI, as well as to adapt to the NLL loss function, we create a training example for each positive POI.For negative samples, all non-answer POIs are candidate negative samples, but previous work (Karpukhin et al., 2020;Xiong et al., 2021a) 2).During inference, the generated POI representations are loaded into memory.Given a question q at run-time, we encode it using the question encoder, score all candidates using the pre-computed representations, and return the top-k results.

Experimental Setup
In this section, we introduce the dataset, baselines, and implementation details of our model.

Dataset
We use the TourismQA (Contractor et al., 2021b) dataset, which comprises over 47,000 real-world POI question-answer pairs from 50 cities across the globe.These questions are genuine queries submitted to a trip advisor website, 2 and the answers are real-world responses that have been chosen and authenticated by annotators.The average length of the questions is 87.48 tokens (separated by whitespace).And on average, there are 3.63 POIs as ground truth answers for each question.
The dataset contains roughly 114,000 candidate POIs altogether, each with a collection of reviews and metadata such as geo-coordinates and type (restaurant, attraction, or hotel).We follow Contractor et al. (2021b) in dividing the dataset into a 9:1 train-test split, and constructing a search space by including POIs located in the same city as the ground truth POIs, resulting in an average of approximately 5,300 candidate POIs per question.We believe one reason for earlier work to build the candidate pool within a city was that their methods struggled with a large candidate pool.However, in real-world scenarios, the ground truth answer is concealed, and the candidate pool may be extensive, encompassing all POIs in the database.Therefore, we established a new evaluation setting in which the search space comprises all POIs in the world.We refer to this new setting as global evaluation (114,000 candidates), and the previous one as local evaluation (5,300 candidates).

Evaluation Metrics
Following Contractor et al. (2021b), we evaluate using Accuracy@N ∈ {3, 5, 30} and mean reciprocal rank (MRR) for local evaluation, and use Accuracy@N ∈ {5, 30, 100} for global evaluation.For Accuracy@N , if the top-N predictions have a non-empty intersection with the answer POI set, the results are considered to be correct.For MRR, we return the reciprocal rank of the first positive answer POI per question, and average over the questions.

Baselines
We compare ourselves against four baselines, as detailed below.Sort by Distance (SD): Given all tagged locations with geo-coordinates in the question, we rank POIs by the minimal distance from the tagged locations.BM25: We represent each POI by its combined reviews, and index them using Apache Lucene.Then questions are used as a query to compute BM25 scores for all POIs.Cluster-Select-Rerank ("CSR") Model (Contractor et al., 2021b), which consists of three components: (1) a clustering module that clusters reviews for each POI and selects representative reviews; (2) a Duet (Mitra and Craswell, 2019) retrieval model that selects the best 30 candidate POIs; and (3) a QA-style re-ranker that scores and re-ranks the selected POIs.Note that the cluster module is used to pre-process the POIs, and the selection and re-ranking modules are trained separately and pipelined.Spatial-Textual CSR (Contractor et al., 2021a), which adds a self-attention based geospatial reasoner to the CSR model, and ranks POIs based on the weighted sum of scores from the geo-spatial reasoner and CSR.

LAMB Implementation Details
We implement our model in PyTorch, and use the HuggingFace (Wolf et al., 2020) implementation of DistilBERT (Sanh et al., 2019) as the textual encoder.The location module is comprised of two transformer blocks that are initialized using the first two blocks of a pre-trained DistilBERT model.We continued pre-training for 3 epochs using triplet loss to force the model to learn more spatial information, as described in Section 2.3.During this process, we set the batch size to 8, learning rate to 2e-5, and the max sequence length to 64.
For the main model of LAMB, the maximum length (in subtokens) for both questions and reviews is set to 256.For training, we use a linear learning rate scheduler with an initial learning rate of 2e-5, and the Adam optimizer with default hyperparameters.For each training instance, we use a single positive POI and varying numbers of negatives.We set the batch size to 8 and train for 10 epochs: 5 epochs of phase 1 (easy and medium negatives), and 5 epochs of phase 2 (medium and hard negatives).All experiments were run on a single Nvidia A100 40GB GPU for about 8 hours.

Results and Analysis
Table 1 shows the overall performance of the baselines and our proposed model.We can see that the sparse-vector retrieval (BM25) and distancebased retrieval (SD) models in the first block of the table perform extremely poorly, demonstrating the difficulty of the task.In contrast, the textualonly pipelined models (CRQA and CSRQA) in the second block improve overall performance substantially, and adding the spatial reasoning subnetwork ("ST+") boosts results again.Note that, since CSRQA is pipelined with a selection model that selects the top-30 results, the spatial-textual module cannot improve Accuracy@30 further.
Compared to the baselines in blocks one and two, our model, LAMB, achieves the state-of-theart across all metrics.To better understand the impact of different components of our model, we conducted an ablation study by separately removing the training phase 2, review selection and summarization modules, and location module.Overall, the performance dropped when one of these modules or strategies was removed, but still outperformed the previous state-of-the-art.Specifically, removing training phase 2 had a relatively large impact on local evaluation, which we attribute to the process

Model
Local Global Acc@3 Acc@5 Acc@30 MRR Acc@5 Acc@30 Acc@100 MRR SD 0.83 of training to distinguish hard negatives.Removing the location module greatly impacted the global evaluation, demonstrating the effectiveness of the location module, particularly when candidates are from around the globe.
Based on our analysis, there are three main reasons why LAMB outperforms previous models: (1) training and inference are end-to-end, avoiding error propagation due to pipelining, as with CSRQA; (2) our use of pre-trained language models as the textual encoder, outperforming static word embeddings or training encoders from scratch; and (3) learning location encodings separately and fusing them with textual representations, providing a soft distance computing method.We provide a comparison between our location module design and other straightforward geo-coordinate-based location/distance modules in Appendix E. From this, we can conclude that compared to strategies that encode geo-coordinates directly, a pretrained location name module better captures spatial information.

Efficiency Comparison
We analyze the computational requirements of the models in Table 2 than the previously-proposed neural models, requiring around 5% of the training time, and <10% of the inference time.It is also able to handle a much larger candidate pool (in the millions of candidates) compared to C(±S)RQA (in the tens or thousands of candidates).Further analysis of efficiency and usability is provided in Appendix D.

Ablation Study on Model Training
To further understand how different model training options affect the results, we conduct several additional experiments and discuss our findings below.
Location Module Analysis In this section, we compare various settings of location modules as shown in Table 3.The table indicates that continuous pretraining of a PLM on location names significantly enhances the module's ability to capture geo-location and distance.Furthermore, using two transformer blocks is sufficient to encode multigranularity location names, whereas more or fewer layers may lead to overfitting or underfitting.Acc@3 Acc@5 Acc@30 MRR Acc@5 Acc@30 Acc@100 MRR number of negatives constant at 15 while varying the mix of easy and hard negatives (as presented in Table 4).As we increase the number of hard negatives, the global evaluation results deteriorate while the local evaluation results improve.This implies that training with easy negatives is more appropriate when the target city or area is unconstrained.The best local evaluation results were achieved when using 12/15 hard negatives, indicating that easy negatives are still necessary for learning general location constraints.We further investigated varying the total number of negatives for contrastive learning, as presented in Table 7 in the Appendix.Our findings indicate that the more negatives we have in each training instance, the better the model performs, but that the relative improvement plateaus beyond around 30.

Two-Phase Training Strategy
We conducted experiments with different epoch configurations for our two-phase training strategy, as detailed in Table 5.Our results indicate that both phase 1 and phase 2 are essential, aligning with the assumptions stated in Section 2.4.Furthermore, we found that commencing phase 2 training at the midway point was particularly effective.

Human Evaluation
To further investigate the dataset and have a better sense of the overall performance of LAMB, we conducted a small-scale human evaluation.We randomly choose 100 questions from the test set and manually evaluate the top-3 predictions for rel-evance based on LAMB as presented in Table 1.For this small question set, our estimate of the true Accuracy@3 is around 75%, as compared to the automatic evaluation result of 24%.This is consistent with the human evaluation results reported in (Contractor et al., 2021a), and points to the issue of low label-recall in the dataset: while a given POI may not have been selected by the user who issued the original question, it may well have satisfied the constraints described in the question.

How ChatGPT Performs on TourismQA
During the writing of this paper, ChatGPT (i.e.GPT3.5) was released.We manually tested 100 questions from Section 4.3 by inputting them directly into ChatGPT (GPT-3.5-turbo on 20-March-2023) and getting a single response. 3The results show that out of the 100 questions, 91 received recommendations for points of interest or areas.However, only 14 of those replies match the ground truth answers, which is lower than our model's performance of 24.We believe that the main reason for this discrepancy is due to differences in the POI databases.The replies from ChatGPT were wellorganized and logical, and could even answer many details in the questions beyond the capabilities of our model.However, we observed that ChatGPT failed to provide an output in many cases: among the 100 replies, sentences such as As an AI language model, I don't have personal experience in ... appeared 36 times, while other outputs like I can recommend that you check out the reviews on websites like Tri-pAdvisor or Booking.comappeared 13 times.Additionally, ChatGPT tended to recommend popular places, with the word popular appearing 44 times in replies, despite not being mentioned in any of the questions.We observed further bias in ChatGPT's recommendations.For example, it recommended Shake Shack nine times in response to fast food requests, but never mentioned other international fast-food chains or local chains, even when questions specifically asked for fast food with regional characteristics.
Lastly, ChatGPT's database is not up-to-date, as also mentioned in its replies.Since OpenAI did not provide full training details, the cost of updating the database, including fine-tuning the model, is unclear.In summary, there is still a real need for a comprehensive recommendation system that can be combined with up-to-date website information.
Based on the type of question, existing work on geospatial QA ("GeoQA") can be classified into four types (Mai et al., 2021): (1) factoid GQA (Li et al., 2021;Hamzei et al., 2022), focusing on answering questions with geographic factoids; (2) geo-analytical QA (Scheider et al., 2020;Xu et al., 2020), focusing on questions with complex spatial analytical intent; (3) visual GQA (Lobry et al., 2020;Janowicz et al., 2020), linking questions to an image or video; and (4) scenariobased GQA (Huang et al., 2019;Contractor et al., 2021b), which associates questions with a scenario described with a map or paragraph of text.Our work corresponds to the last type, and unlike most other work, we do not rely on task-specific query languages or annotations, and focus more on NLP and IR modeling.
Point-of-Interest (POI) Recommendation POI recommendation systems have a wide range of ap-plications such as online navigation applications (Zhao et al., 2019a;Yuan et al., 2021), personalized recommendation systems in location-based social networks (Feng et al., 2015;Zhao et al., 2019b), and trip or accommodation advisory systems (Li et al., 2016;Contractor et al., 2021b).In this research, we focus on POI recommendation incorporating both structured information (such as geocoordinates) and unstructured information (such as textual descriptions).Previous work has explored efficient spatial indexing based on specialized data structures, with textual information as sparse vectors or filters (de Almeida and Rocha-Junior, 2015; Li et al., 2016).Recent work (Contractor et al., 2021b,a) has focused on latent textual representations, which is highly relevant here.
Textual Encoding and Document Retrieval Pretrained language models (PLMs) have led to great successes across many NLP tasks (Devlin et al., 2019;Liu et al., 2019;Yang et al., 2019;Lewis et al., 2020;He et al., 2021;Clark et al., 2020).In the field of QA, PLMs have been used to generate representations of questions and documents (Nogueira et al., 2019;Zhang et al., 2020).In this work, we use DistilBERT (Sanh et al., 2019) as our textual encoder, as it is more efficient than BERT and retains much of its expressivity.
Document retrieval has become a mainstay of research in IR and QA.Recently, IR has increasingly moved towards dense vector retrieval methods (Das et al., 2019;Seo et al., 2019;Xiong et al., 2021b).In particular, Karpukhin et al. (2020) proposed DPR based on a dual-encoder approach, and attained impressive results on multiple open-domain question answering benchmarks.Inspired by this, we adopt a bi-encoder framework.

Conclusion
We have proposed the LAMB model, a locationaware bi-encoder model for answering POI recommendation questions.Experiments on a recentlyreleased tourism question-answering dataset show that our model surpasses existing spatial-textual reasoning models across all metrics.Experiments over LAMB's components and based on changing up the training strategy show the effectiveness of the different design choices used in LAMB.Finally, we analyzed the training and inference efficiency, and demonstrated that our model is resource-efficient at training and inference time, suggesting it can be deployed in real-world tourism applications.

Limitations
Although we have achieved results that significantly outperform the current state-of-the-art, our work still has some limitations.First, as demonstrated in Section 4.3 and in the earlier work of Contractor et al. (2021a), the TourismQA dataset was collected semi-automatically, and the gold labels have high precision but low recall.Hence any results on this dataset are likely an underestimate of the true model performance.While we currently use the Haversine formula to compute the distance between two locations and supervise the pre-training of the location module, we recognize that this calculation may not reflect the actual distance between two places, taking into account the route direction and vertical height difference.In light of the city's urban design, the Manhattan distance might better represent the true distance between two locations within a city.Additionally, POI density could be a factor that influences user choice in real life, in that people may be more inclined to go to locations with a higher density of restaurants to eat (in order to have more options if a given restaurant doesn't live up to their expectations), rather than travel far to a remote place without other options in the local vicinity.For hotels, on the other hand, some users may prefer privacy and a lower density.Such extra-linguistic features are not explicitly captured in our model.

A POI Example
Table 6 shows a POI example, from which we can see that many reviews have similar semantics, making it important to choose representative reviews.In this work, we cluster sentences from reviews, and choose reviews evenly from each cluster to make up the textual input.

C SELSUM Example and Effectiveness
Figure 3 shows an example of SELSUM model output.Table 8 presents the comparison of using clustered reivews, selected reviews (of SELSUM), and summarized reviews.

D Efficiency and Usability Analysis
The most important component of LAMB is the textual encoder, which can be replaced by any pretrained language model.With the increased development of model distillation and compression methods (Jiao et al., 2019;Wang et al., 2020b;Sun et al., 2020) (Johnson et al., 2021) can be used to achieve sub-linear times. 4raining and Update: The training of LAMB takes no more than 12 hours on a single GPU. Figure 4 shows the top-k retrieval accuracy with respect to the number of training epochs, based on which we can see that the model already achieves good results after 5 epochs.Once this has happened, there is no need to retrain the model from scratch: as more and more new questions and POIs appear, to maintain high performance of the model, it should be enough to fine-tune it on the new questions and POIs for one or two additional epochs.

E Comparison to Geo-coordinate-based Location/Distance Module
We compare our location module with straightforward geo-coordinate-based location and distance modules.Specifically, during question preprocessing, we detect location mentions and tag them with geo-coordinates using a geo-tagger.Similar to LAMB, the question location module E loc Q maps the geo-coordinates of the mentioned locations into fixed-length vectors: where m is a hyper-parameter determined based on the average number of location mentions in questions (m = 5 here).Each l i is a 2-d vector [lat i , long i ].If a question contains n > m unique locations, we randomly select m locations as the input to E loc Q , otherwise we pad the input to m with [0, 0].Note that the output dimension d 2 is fixed and independent of the number of locations n.For POI, we simply set m = 1.

Location Module
The location modules for both questions and POIs are implemented with a multilayer perceptron.Since multiple location mentions (geo-coordinates) may exist in a given question while each POI has a unique geolocation, the sizes of the two location modules are slightly different: POIs are represented as [lat, long] (with size = 2), while questions are represented as [lat 1 , long 1 , lat 2 , long 2 , ..., lat m , long m ] (size = 2m).We use a 3-layer MLP with dropout of 0.2 and ReLU activation function to map locations into a 2m-d vector (i.e., d 2 = 2m).
Distance Module Since the location module indiscriminately encodes location mentions from the

Review Module Local Global
Acc@3 Acc@5 Acc@30 MRR Acc@5 Acc@30 Acc@100 MRR question into a fixed-length vector, some of which may be irrelevant or even harmful for POI matching, we add a distance module to explicitly compute a distance score from the location mentions in the question to a POI, followed by min-pooling to choose the minimal distance from the question to a given POI.We use the Haversine formula to compute distances.
To use the distance module, we define similarity between a question q and a POI p using the  weighted sum of the bi-encoder similarity score and distance score: sim(p, q) = (1 − λ)sim(r p , r q ) − λ(dist(p, q)) We negate the distance score to ensure the closer
Two-phase TrainingWe conduct two phases of training: first, we use easy and medium negatives to do warm-up training of the model, and provide the model with a relatively easily-optimizable objective; next, we switch over to training with a mixture of medium and hard negatives.1Wesample hard negatives by performing inference on the training data after each epoch (or a specific number of steps) to find the top-k POIs for each training question.We then create new training instances by randomly sampling N non-answer POIs from the top-k retrieved POIs, and use these to continue training the model.Inference Before inference, we disable the question encoder and generate representations of all POIs using the POI encoder only, and store and index them (as shown in the orange part in Figure has shown that high-quality negative samples help to learn a better encoder.In this research, we consider three different types of negative samples: (1) easy negatives = random (non-answer) POIs from the entire candidate set; (2) medium negatives = random (nonanswer) POIs that are in the same city and of the same type (restaurant, attraction, or hotel) as the answer POI; and (3) hard negatives = top-k ranked non-answer POIs from the previous epoch.

Table 1 :
Overall evaluation on the TourismQA dataset.The second block of results are based on the TourismQA paper, wherein the best results are underlined, and "ST" denotes the spatial-textual module.The overall best results are in bold.The third block presents the results for the full LAMB model, and also with module ablation.

Table 2 :
Runtime comparison, based on a single Nvidia V100 GPU."#Cand" indicates the number of candidate POIs.For CSRQA, time was estimated by summing the times of the component models.

Table 3 :
. LAMB is more time efficient Results with different location module settings.

Table 4 :
Results with differing numbers of easy/hard negatives, total negatives = 15.#HN: number of hard negatives.

Table 5 :
Results with varied epochs in two-phase training, using 10 total training epochs.

Table 6 :
A POI example, where reviews have been segmented into sentences.

Table 7 :
Results with differing numbers of total negatives, with around 3/4 hard negatives.Lines with * signify results with early stopping, because using only hard negatives collapsed the model.

Table 8 :
Comparison of using clustered reviews, selected reviews with SELSUM, and summarized reviews with SELSUM.

Table 9 :
Comparison between LAMB location module and other geo-coordinate-based location/distance modules on local evaluation.