Quantifying the Effects of COVID-19 on Restaurant Reviews

The COVID-19 pandemic has implications beyond physical health, affecting society and economies. Government efforts to slow down the spread of the virus have had a severe impact on many businesses, including restaurants. Mandatory policies such as restaurant closures, bans on social gatherings, and social distancing restrictions have affected restaurant operations as well as customer preferences (e.g., prompting a demand of stricter hygiene standards). As of now, however, it is not clear how and to what extent the pandemic has affected restaurant reviews, an analysis of which could potentially inform policies for addressing this ongoing situation. In this work, we present our efforts to understand the effects of COVID-19 on restaurant reviews, with a focus on Yelp reviews produced during the pandemic for New York City and Los Angeles County restaurants. Overall, we make the following contributions. First, we assemble a dataset of 600 reviews with manual annotations of fine-grained COVID-19 aspects related to restaurants (e.g., hygiene practices, service changes, sympathy and support for local businesses). Second, we address COVID-19 aspect detection using supervised classifiers, weakly-supervised approaches based on keywords, and unsupervised topic modeling approaches, and experimentally show that classifiers based on pre-trained BERT representations achieve the best performance (F1=0.79). Third, we analyze the number and evolution of COVID-related aspects over time and show that the resulting time series have substantial correlation (Spearman’s \rho=0.84) with critical statistics related to the COVID-19 pandemic, including the number of new COVID-19 cases. To our knowledge, this is the first work analyzing the effects of COVID-19 on Yelp restaurant reviews and could potentially inform policies by public health departments, for example, to cover resource utilization.

In this work, we present our efforts to understand the effects of COVID-19 on restaurant reviews, with a focus on Yelp reviews produced during the pandemic for New York City and Los Angeles County restaurants. Overall, we make the following contributions. First, we assemble a dataset of 600 reviews with manual annotations of fine-grained COVID-19 aspects related to restaurants (e.g., hygiene practices, service changes, sympathy and support for local businesses). Second, we address COVID-19 aspect detection using supervised classifiers, weakly-supervised approaches based on keywords, and unsupervised topic modeling approaches, and experimentally show that classifiers based on pretrained BERT representations achieve the best performance (F1=0.79). Third, we analyze the number and evolution of COVID-related aspects over time and show that the resulting time series have substantial correlation (Spearman's ρ=0.84) with critical statistics related to the COVID-19 pandemic, including the number of new COVID-19 cases. To our knowledge, this is the first work analyzing the effects of COVID-19 on Yelp restaurant reviews and could potentially inform policies by public health departments, for example, to cover resource utilization.
… Just know there's no restroom or sink for patrons to wash their hands. They do have hand sanitizers and wipes, but personally I prefer washing my hands … July 30th, 2020 I usually go there for my pizza but I had to walk out because I saw the employees handling the food with no gloves on. In light of the recent outbreak of the Coronavirus how are they still not wearing gloves?

Introduction
The outbreak of the SARS-CoV-2 virus in December of 2019 and its evolution to the COVID-19 pandemic have had many devastating consequences in society. Restaurants have been among the hardesthit businesses during the pandemic. 1 Yelp data (as of September 2020) shows that out of the 32,109 restaurant closures in the U.S., 61% have been permanent, and a greater impact is observed in local businesses in larger metropolitan areas, such as New York City and Los Angeles County, on which we focus in this paper.
Restaurants operate under great uncertainty during this ongoing situation and, therefore, it is critical to understand how the pandemic has affected public attitude towards restaurants. The disruption in daily routines as well as fear and anxiety due to the pandemic have been shown to affect eating habits (Naja and Hamadeh, 2020;Di Renzo et al., 2020). The pandemic may have also affected customers' preferences, such as changes in cuisine types, or higher expectations of hygiene and social distancing practices followed by restaurants.
In this paper, we present our efforts to understand the effects of COVID-19 on restaurant reviews. Reviewers provide ratings and free-form text to express their opinions and experiences about restaurants and we argue that the pandemic has affected such reviews. As an example, Figure 1 shows a Yelp review discussing the hygiene practices of a restaurant, including a mention of "coronavirus" and associated concerns. To understand more broadly the effect of the pandemic on restaurant reviews, we analyze 3 million Yelp reviews published before and during the pandemic, for restaurants in two large metropolitan areas, namely, New York City and Los Angeles County. We measure changes in user activity, ratings, and restaurant type preferences using the corresponding metadata, and quantify changes in written text using relevant extraction and classification techniques.
Overall, we make the following contributions.
Creation of a dataset with fine-grained COVID-19 aspect annotations. To facilitate text analysis, we create a dataset of 600 Yelp restaurant reviews with manual annotations of fine-grained COVID-19 aspects discussed in the reviews, such as hygiene practices, concerns of virus transmission, and sympathy and support messages. Our annotations can support detailed review analyses beyond simple mentions of COVID in text. 2 Evaluation of COVID-19 aspect extraction techniques. We use our dataset to evaluate several techniques for COVID-19 aspect extraction from the review text, including unsupervised topic modeling (Blei et al., 2003), weakly-supervised classification based on COVID-related keywords (Karamanolakis et al., 2019), and (fully) supervised classification.
Analysis of the correlation between Yelp reviews and critical COVID-19 statistics over time. We analyze the distribution and evolution of the extracted COVID-19 aspects and other re-view metadata over time, capturing the period before and during the pandemic. We observe revealing trends, such as increased interest in fast food restaurants compared to traditional Americanfood restaurants (including brunch restaurants), increased mentions of hygienic practices of restaurants (Figure 1), service changes, racist and xenophobic attacks against the Asian American community, and sympathy and support messages expressed especially for local businesses. Crucially, we show that the resulting time series have substantial correlation (Spearman's ρ=0.84, p<0.01) with critical statistics and milestones related to the pandemic, such as the number of COVID-19 cases in the U.S. While our findings do not necessarily imply that the observed trends are caused by the pandemic, they may provide useful insights for restaurant owners, customers, public health officials, and the broad research community. This paper is organized as follows. Section 2 reviews related work. Section 3 describes the Yelp data collection and annotation procedures. Section 4 outlines the techniques for data analysis. Section 5 summarizes our findings. Finally, Section 6 concludes the paper and suggests future work.

Related Work
The natural language processing community has been increasingly pushing efforts towards the better understanding and management of the pandemic. Valuable insight can be extracted from text data, including the COVID-19 scientific literature (Wang et al., 2020;Gutierrez et al., 2020), and web search data (Effenberger et al., 2020;Rovetta and Bhagavathula, 2020). Below, we review related work on the analysis of online user-generated reviews and posts on social media.
Social media reflects public attitudes during the pandemic (Chen et al., 2020). Existing work on sentiment or emotion analysis has considered Twitter (Drias and Drias, 2020;Nemes and Kiss, 2020;Li et al., 2020a;Samuel et al., 2020), Reddit (Biester et al., 2020), Weibo (Li et al., 2020b, and other platforms (Kleinberg et al., 2020). For example, Biester et al. (2020) analyzed how the pandemic has influenced the online behavior of Reddit users and found an increase in posts expressing mental health concerns, including anxiety and concerns for health and family. Beyond sentiment analysis, existing work has considered deep learning techniques for the identification of informative tweets that contain information relevant to the pandemic (Nguyen et al., 2020;Laxmi et al., 2020;Verspoor et al., 2020). To our knowledge, our work is the first that analyzes the effects of COVID-19 on restaurant reviews. To perform this analysis, we extract fine-grained COVID-19 aspects related to restaurants (e.g., hygiene practices, sympathy and support, social distancing, etc.).
Other work has studied nutrition during the pandemic by conducting surveys (Di Renzo et al., 2020) or by analyzing Twitter ( Van et al., 2020). Van et al. (2020) observe a shift from mentions of healthy to unhealthy foods. Naja and Hamadeh (2020) propose a framework for action to maintain optimal nutrition during the pandemic. As part of our work, we show trends in restaurant preferences, such as increased interest in fast food restaurants compared to traditional American-food restaurants.
While prior work demonstrates changes in public attitude and nutrition during the pandemic, it is not clear how and to what extent restaurant reviews have changed during the pandemic. Yelp has introduced special COVID-19 review guidelines, and subsequently removed more than 4,000 reviews that violated those guidelines. 3 Our work demonstrates that many aspects of the review content and metadata have changed during the pandemic.

Yelp Data Collection
We consider Yelp reviews for New York City (NYC) and Los Angeles County (LA) restaurants uploaded over January 1, 2019 -December 31, 2020. Our dataset overalls consists of 1 million reviews for NYC and 2.1 million reviews for LA. Figure 2 plots the number of reviews across time as well as the number of new COVID-19 cases in the U.S. For both NYC and LA, the number of reviews decreases significantly after January 2020, especially in March and April 2020: shutdowns and more stringent guidelines were put into effect starting in March. Such restrictions were only lifted in July 2020 and a second peak in the number of reviews is observed during September 2020.

COVID-19 Aspect Annotation
We manually labeled 600 reviews published after March 2020 with annotations relevant to COVID-19. In particular, we aimed to understand what aspects of restaurant operations are discussed in reviews referring to the pandemic. We will use these labels in Section 4.1 to train and evaluate classifiers for COVID aspect detection. For annotation, we considered 600 Yelp reviews posted after March 1, 2020, selected as follows. First, we considered all reviews after March 1, 2020 that contain COVID-related keywords 4 and selected 400 reviews uniformly at random among them. Second, we selected 200 reviews uniformly at random from all reviews after March 1, 2020 that do not contain such keywords. We considered the following aspects related to COVID-19: 1. Hygiene: hygiene conditions of restaurants and protective equipment (e.g., "Just know there's no restroom or sink for patrons to wash their hands. They do have hand sanitizers and wipes, but personally I prefer washing my hands.").
2. Transmission: concern of virus transmission (e.g., "All the whole coughing without covering his mouth").
3. Social Distancing: social distancing measures (e.g., "The tables are set far apart -a more than acceptable social distance").
4. Racism: racism experiences (e.g., "She was the only one waiting at the register but no one came to ring her up. She waited for a while but decided to leave after realizing she was ignored because of her race.").
5. Sympathy and Support: messages of solidarity, for example, towards local businesses (e.g., "Help support your Chinatown restaurants who are deeply hurting from the stigma around corona virus.").
6. Service: service changes during the pandemic (e.g., "Not sure if the restaurant was empty because of the coronavirus scare but the food came out suuuuper fast...").
7. Other: aspects that are related to COVID but that do not fall under any of the above categories (e.g., "Shame on management for taking advantage of people trying to keep safe from coronavirus during a NY state of emergency.").
We annotated each review with a COVID-related aspect if at least a sentence of the review discusses such aspect. A single review can be annotated with more than one distinct aspect. In cases where a review did not contain any sentences that were deemed relevant to any of the seven COVID-related aspects, then it received the "Non-COVID" aspect. Table 1 shows annotation statistics. (We discuss review ratings later.) Most reviews discuss hygiene conditions of restaurants, and many reviews discuss social distance measures as well as changes in the restaurant service related to COVID.

Methodology
We now describe the techniques that we apply to the 3.1 million Yelp reviews from Section 3 for COVID-19 aspect analysis (Section 4.1) and time series analysis (Section 4.2), leveraging the labeled reviews of Section 3.2.

COVID-19 Aspect Analysis
First, we extract topics from reviews using unsupervised topic modeling. We train Latent Dirichlet Allocation (LDA) topic models (Blei et al., 2003) with different numbers of topics (5,10,25,50,100). Then, we manually annotate the obtained topics with descriptive labels by examining the highestprobability words for each topic. We noticed that it is hard to align the topics discovered by LDA with the COVID aspects of interest (Section 3.2) and, therefore, we experiment with supervised and weakly-supervised techniques, as discussed next.
We use our annotated dataset from Section 3.2 to train and evaluate review classifiers (via 5-fold cross-validation) for multi-class COVID-19 aspect classification. We consider two alternative training procedures: fully-supervised classification using labeled training data, and weakly-supervised classification using a small number of indicative keywords per class. The fully-supervised approaches are standard and listed at the end of this subsection.
Overall, we consider the following approaches: 1. Random: assigns reviews to a random aspect.
3. Supervised bag-of-words (BoW) classifiers: represents each review as a bag of words, where words can be unigrams and bigrams. We evaluate logistic regression (LogReg) and Support Vector Machines (SVM).
The above techniques classify Yelp reviews into COVID aspects using either labeled data (supervised approach) or COVID-related keywords (weakly-supervised approach) for training. In addition to COVID aspect classification, we conduct time series analysis to understand how COVID aspects evolve over time, as discussed next.

Time Series Analysis
To understand how reviews have changed during the pandemic, we extract time series from the text of the reviews. For a given aspect (e.g., Hygiene), the corresponding time series is computed as the percentage of the reviews at each point in time that contain at least one aspect-specific keyword (see Section 4.1). We consider two approaches: time-series cross-correlation and time-series intervention analysis, as discussed next. As a first approach, we measure the correlation between the Yelp review time series and important statistics related to COVID-19, such as the number of new COVID-19 cases in the U.S or the new COVID-19 cases in NYC and LA individually. As we do not expect Yelp review time series to have a linear relationship with COVID-19 time series, we compute the Spearman's correlation metric, which only assumes a monotonic but possibly non-linear relationship between the two time series. We also measure the Pearson's correlation metric as a robustness check.
As a second approach, we consider a time series intervention analysis. First, we train a timeseries model on the observations before COVID-19 (i.e., on reviews posted before March 1, 2020) and then we compare the model's predictions against the observations during COVID-19 (i.e., on reviews posted on March 1, 2020 or later). Similar to Biester et al. (2020), we consider the Prophet time-series forecasting model (Taylor and Letham, 2018), an additive regression model that has been shown to forecast social media time series effectively. After training Prophet on the pre-pandemic data, we check to what degree its forecasts for during COVID-19 differed from the actual values. Specifically, we compute the proportion of observations outside the 95% prediction uncertainty interval produced by Prophet after March 1, 2020.
By constructing Yelp review time series and comparing them to statistics related to COVID-19, we find interesting trends in reviews during the pandemic, as discussed next.

Findings
We use the methodology from Section 4 to address various questions on the 3.1 million-review Topic label (manually assigned) 10 highest-probability words Protective equipment and social distancing covid, mask, masks, people, customers, staff, social, wearing, distancing, pandemic Outdoor seating outdoor, seating, dining, ramen, good, tables, covid, outside, really, place  dataset from Section 3. First, we analyze the text of the Yelp reviews (Section 5.1), and then we use both the metadata and the text to create time series and evaluate their correlation with the number of COVID-19 cases (Section 5.2).

COVID-19 Aspect Analysis
In this section, we analyze the text of the Yelp reviews and evaluate the performance of several methods for COVID aspect classification on our manually annotated dataset from Section 3.2.
Number of reviews with COVID-related keywords: Figure 3 shows the percentage of reviews that contain COVID-related keywords. Interestingly, after March 2020, more than 10% of the reviews contain COVID-related keywords: thousands of restaurant reviews per week discuss aspects related to the pandemic.
Topics discussed in reviews: We apply topic modeling on all Yelp reviews after March 1, 2020. Table 2 shows the two (out of the 25) topics that we identified as relevant to COVID-19. The first topic is related to protective equipment and social distancing, while the second topic is related to outdoor seating. The remaining 23 topics did not contain any COVID-related keywords among the 10 highest-probability words: it is hard to align the topics discovered by LDA with the fine-grained COVID aspects of Section 3.2, so we consider aspect classification approaches, as discussed next.
COVID aspect classification: We evaluate supervised and weakly-supervised approaches for COVID aspect classification via cross-validation using the 600 manually annotated reviews (Section 4.1). Table 3 shows the cross-validation results for binary (COVID vs. Not COVID) and multiclass aspect classification. Table 8 in the Appendix reports additional metrics. The fully supervised BERT-based classifier outperforms BoW-* classifiers on both binary and multi-class classification.
The weakly-supervised Teacher that classifies aspects using keywords and no labeled data (Section 4.1) leads to a more accurate Student-BERT classifier: weakly-supervised co-training with keywords leads to substantially better performance than Random. The weakly-supervised Student-BERT has lower F1 score than the fully supervised BERT, which was expected because Student-BERT does not consider labeled reviews for training but instead uses Teacher's predictions on unlabeled reviews as weak supervision.

Time-series Construction
In this section, we analyze how reviews have changed during the pandemic by extracting time series from metadata (star ratings, cuisine types) and the text of the reviews (see Section 4.2).
Star ratings: Figure 4 shows the average star rating over time for NYC and LA. For both time series, there is a sharp decrease in average rating starting in March 2020 and an increasing trend after June 2020. Figure 5 shows the number of star ratings across time for NYC. The trends are similar for LA ( Figure 9b in the Appendix). For the first time after 2019, the percentage of 1-star ratings in NYC surpassed the percentage of 4-star ratings. Interestingly, for both NYC and LA, a peak in the number of new COVID-19 cases (April 2020 for NYC and July 2020 for LA) coincides with a peak in the percentage of 1-star ratings. Also, after September 2020, there is a decreasing trend in the number of 1-star ratings and an increasing trend in the number of 5-star ratings. We conclude that, during the first months of the pandemic, users' ratings shifted to extremely positive (5 star) or extremely negative (1 star) values, but after September 2020, users posted increasingly more 5-star rating reviews, leading to a total increase in average rating.
Types of cuisine: Restaurant metadata include tags that indicate the cuisine types, such as "Italian" and "sandwich." Figure 6 shows the percentage of reviews for selected groups of cuisine type over time.Such time series are relatively stable during 2019 but change significantly during 2020. "American" substantially dropped at the beginning of the pandemic (March) and rose again after indoor dining re-opened (July). The drop in "American" coincided with the increase of "Fast Food." "Asian Food" also dropped sharply in March but recovered quickly within 2 weeks. These trends indicate important changes in user activity during the pandemic that affect specific cuisine types, which could be supported by previous observations of nutrition changes ( Van et al., 2020).  Evolution of restaurant review aspects over time: Figure 7 shows the evolution of aspects over time for NYC. Aspects for LA reviews follow similar trends (see Table 12 in the Appendix). Aspects such as "Hygiene," and "Social Distancing" have been discussed more frequently after March 2020, covering up to 8% of the restaurant reviews: reviewers discuss such aspects during the pandemic more than before the pandemic. Interestingly, while "Hygiene" peaked during July 2020 (during restaurant re-opening) for both cities and since then keeps decreasing, "Sympathy & Support" peaked during Spring 2020, then decreased, and follows an increasing trend after November 2020.
Correlation of aspects with COVID-19 statistics: We now consider our first approach for time series analysis from Section 4.2 and measure the correlation between Yelp review time series and COVID-19 statistics.  W H O d e c la r e d P H E IC r e s t a u r a n t c lo s e , s t a g e 1 d is a s t e r d e c la r a t io n s t a g e 2 s t a g e 3 in d o o r c lo s e d a g a in Figure 6: Evolution of cuisine types over time for LA. For each time series, we compute the percentage of reviews that include at least one tag from a predefined tag list: "American": ["steak", "cocktailbars", "bars", "breakfast brunch", "newamerican", "tradamerican"], "Fast Food": ["sandwiches", "pizza", "hotdogs", "chicken wings", "thai"], "Groceries": ["grocery"], "Deserts&Drinks": ['juicebars," 'bubbletea," 'icecream," 'desserts," 'bakeries'], "Asian&Seafood": ["sushi", "japanese", "seafood", "asianfusion", "korean"]. Tags within each category follow similar trends, which we individually report in the Appendix.  cases compared to the number of LA cases. For NYC, most aspects present higher correlation with the number of NYC cases compared to the number of US cases. Even though we cannot draw causal conclusions from these correlations, our results highlight interesting trends of Yelp reviews during the pandemic.
Time-series intervention analysis: Here, we consider our second approach for time series analysis (Section 4.2) and compare time series constructed from the metadata of Yelp reviews to the corresponding Prophet forecasts. Figure 8 shows the evolution of the "pizza" tag (left) and "seafood" tag (right) over time and the Prophet forecasts. During COVID-19 (i.e., on March 1, 2020 or later), most true values for "pizza" were higher than forecasts while most true values for "seafood" were lower than Prophet forecasts. The Appendix reports forecasts for more cuisine tags. The difference between Prophet's forecasts and true values indicates that user activity has shifted towards specific types of businesses, as we further discuss next.   Table 5 reports the percentage of outliers (i.e., true values outside of Prophet's 95% uncertainty interval) for star ratings (top) and some of the most frequent business tags (bottom). For tags such as "Grocery," "Chicken Wings," and "Sandwiches," upwards pointing arrows indicate that the mean value of outliers is higher than the mean of Prophet's predictions. In contrast, for tags such as "New American," "Asian Fusion," and "Japanese," downwards pointing arrows indicate that the mean value of outliers is lower than the mean of Prophet's predictions. The Appendix reports all forecasts of Prophet. The direction arrows in Table 5 support our previous observations about the corresponding changes of cuisine types and star ratings during the pandemic.

Conclusions and Future Work
We presented our effort to understand the effects of COVID-19 on restaurant reviews. We created a dataset with fine-grained COVID-19 aspect annotations, evaluated fully-and weakly-supervised techniques for COVID aspect detection, and showed that BERT-based classifiers outperform bag-ofwords classifiers. We observed changes in restaurant reviews (e.g., increased discussions of hygiene practices and messages of solidarity), and showed that they correlate with critical COVID-19 statistics. We found a shift of ratings towards extreme values (1 and 5 stars) and shifts of user activity towards specific types of cuisines. Our insights could potentially be interesting for restaurant owners, customers, and public health officials.
In future work, we plan to expand the regional coverage of our analysis to reveal distinct patterns across cities. It would also be interesting to improve aspect-based sentiment analysis approaches (Pontiki et al., 2016)

A Appendix
Here, we provide detailed information on our dataset (Section A.1), topic modeling and aspect classification results (Section A.2), time series plots (Section A.3), correlation analysis results A.4, and time-series intervention analysis results (Section A.5). Table 6 shows more statistics for our dataset. Our COVID aspect annotations for the 600 Yelp reviews are available at the following link: https://driv e.google.com/drive/folders/1PwYGO68fDjpj

A.1 Yelp Review Dataset
RgKN6rry-P9ji570Ia-r. Table 7 shows the 25 LDA topics obtained from all reviews posted after March 1, 2020. Table 8 reports detailed evaluation results for COVID aspect classification.

A.3 Time Series Plots
Star ratings: Figure 9 shows the number of star ratings across time for NYC and LA.
Cuisine types: Figure 10 shows the percentage of reviews in NYC (top) and (LA) over time for each selected group of cuisine types. Figure 11 shows the percentage of reviews over time for each individual business tag in our selected groups of cuisine types.
COVID-19 aspects: Figure 12 shows the percentage of reviews over time for each individual business tag in our selected groups of cuisine types.

A.4 Correlation Analysis
Tables 9 and 10 report correlation results between time series constructed from restaurant reviews and the number of new COVID-19 cases for NYC and LA, respectively. Tables 11 and 12 show correlation results between each individual business tag and the number of new COVID-19 cases for NYC and LA, respectively.

Time Series LA Cases US Cases
Results below are Pearson correlations.