Point-of-Interest Type Prediction using Text and Images

Point-of-interest (POI) type prediction is the task of inferring the type of a place from where a social media post was shared. Inferring a POI's type is useful for studies in computational social science including sociolinguistics, geosemiotics, and cultural geography, and has applications in geosocial networking technologies such as recommendation and visualization systems. Prior efforts in POI type prediction focus solely on text, without taking visual information into account. However in reality, the variety of modalities, as well as their semiotic relationships with one another, shape communication and interactions in social media. This paper presents a study on POI type prediction using multimodal information from text and images available at posting time. For that purpose, we enrich a currently available data set for POI type prediction with the images that accompany the text messages. Our proposed method extracts relevant information from each modality to effectively capture interactions between text and image achieving a macro F1 of 47.21 across eight categories significantly outperforming the state-of-the-art method for POI type prediction based on text-only methods. Finally, we provide a detailed analysis to shed light on cross-modal interactions and the limitations of our best performing model.


Introduction
A place is typically described as a physical space infused with human meaning and experiences that facilitate communication (Tuan, 1977). The multimodal content of social media posts (e.g. text, images, emojis) generated by users from specific places such as restaurants, shops, and parks, contribute to shaping a place's identity, by offering information about feelings elicited by participating 1 Code and data are available here: https://github .com/danaesavi/poi-type-prediction imagine all the people sharing all the world ∼ Next stop: NYC Figure 1: Example of text and image content of sample tweets. Users share content that is relevant to their experiences and feelings in the location.
in an activity or living an experience in that place (Tanasescu et al., 2013). Fig. 1 shows examples of Twitter posts consisting of image-text pairs, shared from two different places or Point-of-Interests (POIs). Users share content that is relevant to their experience in the location. For example, the text imagine all the people sharing all the world which is accompanied by a photograph of the Imagine Mosaic in Central Park; and the text Next stop: NYC along with a picture of descriptive items that people carry at an airport such as luggage, a camera and a takeaway coffee cup.
Developing computational methods to infer the type of a POI from social media posts (Liu et al., 2012; is useful for complementing studies in computational social science including sociolinguistics, geosemiotics, and cultural geography (Kress et al., 1996;Scollon and Scollon, 2003;Al Zydjaly, 2014), and has applications in geosocial networking technologies such as recommendation and visualization systems (Alazzawi et al., 2012;Zhang and Cheng, 2018;van Weerdenburg et al., 2019;Liu et al., 2020b).
Previous work in natural language processing (NLP) has investigated the language that people use in social media from different locations, by inferring the type of a POI of a given social media post using only text and posting time, ignoring the visual context . However, communication and interactions in social media are naturally shaped by the variety of available modalities and their semiotic relationships (i.e. how meaning is created and communicated) with one another (Georgakopoulou and Spilioti, 2015;Kruk et al., 2019;Vempala and Preoţiuc-Pietro, 2019).
In this paper, we propose POI type prediction using multimodal content available at posting time by taking into account textual and visual information. Our contributions are as follows: • We enrich a publicly available data set of social media posts and POI types with images; • We propose a multimodal model that combines text and images in two levels using: (i) a modality gate to control the amount of information needed from the text and image; (ii) a cross-attention mechanism to learn crossmodal interactions. Our model significantly outperforms the best state-of-the-art method proposed by Sánchez ; • We provide an in-depth analysis to uncover the limitations of our model and uncover cross-modal characteristics of POI types.
2 Related Work

POI Analysis
POIs have been studied to classify functional regions (e.g. residential, business, and transportation areas) and to analyze activity patterns using social media check-in data and geo-referenced images (Zhi et al., 2016;Liu et al., 2020a;Zhou et al., 2020a;Zhang et al., 2020). Zhou et al. (2020a) presents a model for classifying POI function types (e.g. bank, entertainment, culture) using POI names and a list of results produced by searching for the POI name in a web search engine. Zhang et al. (2020) makes use of social media check-ins and street-level images to compare the different activity patterns of visitors and locals, and uncover inconspicuous but interesting places for them in a city. A framework for extracting emotions (e.g. joy, happiness) from photos taken at various locations in social media is described in Kang et al. (2019).

POI Type Prediction
POI type prediction is related to geolocation prediction of social media posts that has been widely studied in NLP (Eisenstein et al., 2010;Roller et al., 2012;Dredze et al., 2016). However, while geolocation prediction aims to infer the exact geographical location of a post using language variation and geographical cues, POI type prediction is focused on identifying the characteristics associated with each type of place, regardless of its geographic location. Previous work on POI type prediction from social media content has used Twitter posts (text and posting time), to identify the POI type from where a post was sent from (Liu et al., 2012;. Liu et al. (2012) incorporate text, temporal features (posting hour) and user history information into probabilistic text classification models. Rather than a user-based study, our research aims to uncover the characteristics associated with various types of POIs. Sánchez  analyze semantic place information of different types of POIs by using text and temporal information (hour, and day of the week) of a Twitter's post. To the best of our knowledge, this is the first study to combine textual and visual features to classify POI types (e.g. arts & entertainment, nightlife spot) from social media messages, regardless of its geographic location.

Social Media Analysis using Text and Images
The combination of text and images of social media posts has been largely used for different applications such as sentiment analysis, (Nguyen and Shirai, 2015;Chambers et al., 2015), sarcasm detection (Cai et al., 2019) and text-image relation classification (Vempala and Preoţiuc-Pietro, 2019;Kruk et al., 2019). Moon et al. (2018b)  Social media analysis research has also looked at the semiotic properties of text-image pairs in posts (Alikhani et al., 2019;Vempala and Preoţiuc-Pietro, 2019;Kruk et al., 2019). Vempala and Preoţiuc-Pietro (2019) investigate the relationship between text and image content by identifying overlapping meaning in both modalities, those where one modality contributes with additional details, and cases where each modality contributes with different information. Kruk et al. (2019) analyze the relationship between the text-image pairs and find that when the image and caption diverge semiotically, the benefit from multimodal modeling is greater.

Task & Data
Sánchez  define POI type prediction as a multi-class classification task where given the text content of a post, the goal is to classify it in one of the M POI categories. In this work, we extend this task definition to include images in order to capture the semiotic relationships between the two modalities. For that purpose, we consider a social media post P (e.g. tweet) to comprise of a text and image pair (x t , x v ), where x t ∈ R dt and x v ∈ R dv are the textual and visual vector representations respectively.

POI Data
We use the data set introduced by Sánchez Villegas et al. (2020) which contains 196, 235 tweets written in English, labeled with one out of the eight POI broad type categories shown in Table 1, which correspond to the 8 primary top-level POI categories in 'Places by Foursquare', a database of over 105 million POIs worldwide managed by Foursquare. To generalize to locations not present in the training set, we use the same location-level data splits (train, dev, test) as in Sánchez , where each split contains tweets from different locations.

Image Collection
We use the Twitter API to collect the images that accompany each textual post in the data set. For the tweets that have more than one image, we select the first available only. This results in 91, 224 tweets with at least one image. During the image processing (see Section 5.3) we removed 129 images because we found they were either damaged, absent 3 , or no objects were detected, resulting in 91, 095 text-image pairs (see Table 1 for data statistics). In order to deal with the rest of the tweets with no associated image, we pair them with a single 'average' image computed over all images in the train set: The intuition behind this approach is to generate a 'noisy' image that is not related and does not add to the meaning (Vempala and Preoţiuc-Pietro, 2019). 4

Exploratory Analysis of Image Data
To shed light on the characteristics of the collected images, we apply object detection on the images  (Krishna et al., 2017;Anderson et al., 2018). Table 2 shows the most common objects for each specific category. We observe that most objects are related to items one would find in each place category (e.g. 'spoon', 'meat', 'knife' in Food). Clothing items are common across category types (e.g. 'shirt', 'jacket', 'pants') suggesting the presence of people in the images. A common object tag of the Shop & Service category is 'letters', which concerns images that contain embedded text. Finally, the category Great Outdoors includes object tags such as 'cloud', 'hill', and 'grass', words that describe the landscape of this type of place.

Text and Image Representation
Given a text-image post Image For encoding the images, we use Xception (Chollet, 2017) pre-trained on ImageNet (Deng et al., 2009). 5 We extract convolutional feature maps for each image and we apply average pooling to obtain the image representation f v .

MM-Gate
Given the complex semiotic relationship between text and image, we need a weighting strategy that assigns more importance to the most relevant modality while suppressing irrelevant information. Thus, a first approach is to use gated multimodal fusion (MM-Gate), similar to the approach proposed by Arevalo et al. (2020) to control the contribution of text and image to the POI type prediction. Given f t , f v the text and visual vectors, we obtain the multimodal representation h of a post P as follows: [; ] denotes concatenation and σ is the sigmoid activation function. h is a weighted combination of the textual and visual information h t and h v respectively. We fine-tune the entire model by adding a classification layer with a softmax activation function for POI type prediction

MM-XAtt
The MM-Gate model does not capture interactions between text and image that might be beneficial for learning semiotic relationships. To model crossmodal interactions, we adapt the cross-attention mechanism (Tsai et al., 2019; to combine text and image information for multimodal POI type prediction (MM-XAtt). Cross-attention consists of two attention layers, one from textual f t to visual features f v and one from visual to textual features. We first linearly project the text and visual representations to obtain the same dimensionality (d proj ). Then, we compute with the projected textual vector as query (Q), and the projected image vector as the key (K) and values (V ), and vice versa. The multimodal representation h is the sum of the resulting attention layers. The entire model is fine-tuned by adding a classification layer with a softmax activation function.

MM-Gated-XAtt
Vempala and Preoţiuc-Pietro (2019) have demonstrated that the relationship between the text and image in a social media post is complex. Images may or may not add meaning to the post and the text content (or meaning) may or may not correspond to the image. We hypothesize that this might actually happen in posts made from particular locations, i.e. language and visual information may or may not be related. To address this, we propose (1) using gated multimodal fusion to manage the flow of information from each modality, and (2) also learn cross-modal interactions by using cross-attention on top of the gated multimodal mechanism. Fig.  2 shows an overview of our model architecture (MM-Gated-XAtt). Given the text and image repre- h v , and z as in Equation 1, 2 and 3. Next, we apply cross-attention using two attention layers where the query and context vectors are the weighted representations of the text and visual modalities, z * h t and (1 − z) * h v , and vice versa. The multimodal context vector h is the sum of the resulting attention layers. Finally, we fine-tune the model by passing h through a classification layer for POI type prediction with a softmax activation function.
Text and Image For combining text and image information, we experiment with different standard fusion strategies: (1) we project the image representation f v , to the same dimensionality as f t ∈ R dt using a linear layer and then we concatenate the vectors (Concat); (2) we project the textual and visual features to the same space and then we apply self-attention to learn weights for each modality (Attention); (3) we also adapt the guided attention introduced by Anderson et al.
(2018) for learning attention weights at the objectlevel (and other salient regions) rather than equally sized grid-regions (Guided Attention); (4) we compare against LXMERT, a transformer-based model that has been pre-trained on text and image pairs for learning cross-modality interactions . All models are fine-tuned by adding a classification layer with a softmax activation function for POI type prediction. Finally, we evaluate a simple ensemble strategy by using LXMERT for classifying tweets that are originally accompanied by an image and BERT for classifying text-only tweets (Ensemble).

Text Processing
We use the same tokenization settings as in Sánchez . For each tweet, we lowercase text and replace URLs and @-mentions of users with placeholder tokens.

Image Processing
Each image is resized to ( (2019).

Implementation Details
We select the hyperparameters for all models using early stopping by monitoring the validation loss using the Adam optimizer (Kingma and Ba, 2014). Because the data is imbalanced, we estimate the class weights using the 'balanced' heuristic (King and Zeng, 2001). All experiments are performed using a Nvidia V100 GPU.

Evaluation
We evaluate the performance of all models using macro F1, precision, and recall. Results are obtained over three runs using different random seeds reporting the average and the standard deviation.

Results
The results of POI type prediction are presented in Table 3. We first examine the impact of each modality by analyzing the performance of the unimodal models, then we investigate the effect of multimodal methods for POI type prediction, and finally we examine the performance of our proposed model MM-Gated-XAtt by analyzing each component independently.  We observe that the text-only model (BERT) achieves 43.67 F1 which is substantially higher than the performance of image-only models (e.g. the best performing EfficientNet model obtains 24.72 F1). This suggests that text encapsulates more relevant information for this task than images on their own, similar to other studies in multimodal computational social science (Wang et al., 2020;Ma et al., 2021).
Models that simply concatenate text and image vectors have close performance to BERT (44.0 for Concat-BERT+Xception) or lower (41.56 for Concat-BERT+EfficientNet). This suggests that assigning equal importance to text and image information can deteriorate performance. It also shows that modeling cross-modal interactions is necessary to boost performance of POI type classification models.
Surprisingly, we observe that the pre-trained multimodal LXMERT fails to improve over BERT (40.17 F1) while its performance is lower than simpler concatenative fusion models. We speculate that this is because LXMERT is pretrained on data where both, text and image modalities share common semantic relationships which is the case in standard vision-language tasks including image captioning and visual question answering (Zhou et al., 2020b;Lu et al., 2019). On the other hand, text-image relationships in social media data for inferring the type of location from which a message was sent are more diverse, highlighting the particular challenges for modeling text and images together (Hessel and Lee, 2020).
Our proposed MM-Gated-XAtt model achieves 47.21 F1 which significantly (t-test, p < 0.05) improves over BERT, the best performing model in Sánchez  and consistently outperforms all other image-only and multimodal approaches. This confirms our main hypothesis that modeling text with image jointly to learn the interactions between modalities benefit performance in POI type prediction. We also observe that using only the gating mechanism (MM-Gate) outperforms (44.64 F1) all other models except for MM-Gated-XAtt. This highlights the importance of controlling the information flow for the two modalities. Using cross-attention on its own (MM-XAtt), on the other hand, fails to improve over other multimodal approaches, implying that learning crossmodal interactions is not sufficient on its own. This supports our hypothesis that language and visual information in posts sent from specific locations may be or may not be related, and that managing the flow of information from each modality improves the classifier's performance.
Finally, we investigate using less noisy textimage pairs in alignment with related computational social science studies involving text and images (Moon et al., 2018b;Cai et al., 2019;Chinnappa et al., 2019). We train and test LXMERT, MM-Gate, MM-XAtt, and MM-Gated-XAtt on tweets that are originally accompanied by an image (see Section 3), excluding all text-only tweets. The results are shown in Table 4. In general, performance is higher for all models using less noisy data. Our proposed model MM-Gated-XAtt consistently achieves the best performance (57.64 F1). In addition, we observe that LXMERT and MM-XAtt produce similar results (47.72 and 48.93 F1 respectively) suggesting that cross-attention can be applied directly to text-image pairs in low-noise settings without hurting the model performance. The benefit of controlling the flow of information through a gating mechanism, on the other hand, strongly improves model robustness.

Training on Text-Image Pairs Only
To compare the effect of the 'average' image (see Section 3) on the performance of the models, we train MM-Gate, MM-XAtt, and MM-Gated-XAtt on tweets that are originally accompanied by an image excluding all text-only tweets; and we test on all tweets as in our original setting (text-only tweets are paired with the 'average' image). The results are shown in Table 5. MM-Gated-XAtt is consistently the best performing model, followed by MM-Gate. However, their performance is inferior than when models are trained on all tweets using the 'average' image as in the original setting. This suggests that the gate operation not only regulates the flow of information for each modality but also learns how to use the noisy modality to

Modality Contribution
To determine the influence of each modality in MM-Gated-XAtt when assigning a particular label to a tweet, we compute the average percentage of activations for the textual and visual modalities for each POI category on the test set. The outcome of this analysis is depicted in Fig. 3. As anticipated, the textual modality has a greater influence on the model prediction, which is consistent with our findings in Section 6. The category where the visual modality has greater impact on the predicted label is Professional & Other Places (43.20%) followed by Shop & Service (43.11%).
To examine how the visual information impacts the POI type prediction task, Fig. 4 shows examples of posts where the contribution of the image is large while the text-only model (BERT) misclassified the POI category. We observe that the text content of Post (a) misled BERT towards Food, For this post 40% of the contribution corresponds to the image and 60% to text. This shows how image information can help to address the ambiguity in short texts (Moon et al., 2018a), improving POI type prediction.

Fig. 4 shows examples of the XAtt visualization.
We note that the model focuses on relevant nouns and pronouns (e.g. 'track', 'it'), which are common informative words in vision-and-language tasks . Moreover, our model focuses on relevant words such as 'track' for classifying Post (a) as Great Outdoors. Lastly, we observe that the XAtt often captures a general image information, with emphasis on specific sections for the predicted POI category such as the pine trees for Great Outdoors and the display racks for Shop & Service.

Error Analysis
To shed light on the limitations of our multimodal MM-Gated-XAtt model for predicting POI types, we performed an analysis of misclassifications. In general, we observe that the model struggles with identifying POI categories where people might perform similar activities in each of them such as Food, Nightlife Spot, and Shop & Service similar to findings by Ye et al. (2011). In both cases, the text and image content is related to the Food category, misleading the classifier towards this POI type. Posting about food is a common practice in hospitality establishments such as restaurants and bars (Zhu et al., 2019), where customers are more likely to share content such as photos of dishes and beverages, intentionally designed to show that are associated with the particular context and lifestyle that a specific place represents (Homburg et al., 2015;Brunner et al., 2016;Apaolaza et al., 2021). Similarly, Post (b) shows an example of a tweet that promotes a POI by communicating specific characteristics of the place (Kruk et al., 2019;Aydin, 2020). To correctly classify the category of POIs, the model might need access to deeper contextual information about the locations (e.g. finer subcategories of a type of place and how POI types are related to one another).

Conclusion and Future Work
This paper presents the first study on multimodal POI type classification using text and images from social media posts motivated by studies in geosemiotics, visual semiotics and cultural geography. We enrich a publicly available data set with images and we propose a multimodal model that uses: (1) a gate mechanism to control the information flow from each modality; (2) a cross-attention mechanism to align and capture the interactions between modalities. Our model achieves state-of-the-art performance for POI type prediction significantly outperforming the previous text-only model and competitive pretrained multimodal models.
In future work, we plan to perform more granular prediction of POI types and user information to provide additional context to the models. Our models could also be used for modeling other tasks where text and images naturally occur in social

Ethical Statement
Our work complies with Twitter data policy for research, 7 and has received approval from the Ethics Committee of our institution (Ref.