Classiﬁcation and Geotemporal Analysis of Quality-of-Life Issues in Tenant Reviews

Online tenant reviews of multifamily residen-tial properties present a unique source of information for commercial real estate investing and research. Real estate professionals frequently read tenant reviews to uncover property-related issues that are otherwise difﬁcult to detect, a process that is both biased and time-consuming. Using this as motiva-tion, we asked whether a text classiﬁcation-based approach can automate the detection of four carefully deﬁned, major quality-of-life issues: severe crime, noise nuisance, pest bur-den, and parking difﬁculties. We aggregate 5.5 million tenant reviews from ﬁve sources and use two-stage crowdsourced labeling on 0.1% of the data to produce high-quality labels for subsequent text classiﬁcation. Follow-ing ﬁne-tuning of pretrained language models on millions of reviews, we train a multi-label reviews classiﬁer that achieves a mean AUROC of 0.965 on these labels. We next use the model to reveal temporal and spatial patterns among tens of thousands of multifamily properties. Collectively, these results highlight the feasibility of automated analysis of housing trends and investment opportunities using tenant-perspective data.


Introduction
The use of artificial intelligence in commercial real estate investing has grown given the availability of new data modalities. Motivated by the potential for new insights and improving investment decisions in the large real estate market, recent efforts have used cellular network data (Pinter et al., 2020), satellite images (Law et al., 2019), building permits (Lai and Kontokosta, 2019), interior and exterior photos for luxury estimation and automated appraisal (Poursaeed et al., 2018), and construction of new retail stores for predicting future rent growth (Humphries and Rascoff, 2015), among others. However, one mostly untapped, yet highly informative, data source, is online tenant reviews.
Online tenant reviews of the properties in which tenants reside present a unique source of information in the multifamily domain due to their distinctive, tenant-perspective view (Fradkin et al., 2015). In recent years, the popularity of such reviews has grown such that there are now millions of newly generated reviews annually, with some properties garnering hundreds and even thousands of reviews over time. Nonetheless, as they are rarely constrained to a specific format and can drastically vary in length and linguistic style, classifying reviews for detection of quality-of-life issues is a challenging task.
Text classification refers to the process of categorizing textual data into a set of defined classes. Classical approaches to text classification rely on feature extraction techniques such as n-grams, Bagof-Words, and TF-IDF, a potential dimensionality reduction step, followed by learning a classification model such as Logistic Regression, Naive Bayes, Support Vector Machines, Latent Dirichlet Allocation, and Nearest-Neighbours algorithms (Kowsari et al., 2019;Kiatkawsin et al., 2020). More recently, deep-learning-based language models that are trained using contextualized word representations have been used to achieve state-of-the-art results on a wide range of natural language benchmarks and datasets, including text classification (Devlin et al., 2019;Lewis et al., 2020;Minaee et al., 2021;. Deep-learning language models generally require large training data, use up to billions of parameters, and are costly to train. Fortunately, language models pretrained on large corpora such as Wikipedia or Common Crawl can be adapted to perform tasks in diverse domains, very effectively and with little labeled data (Sun et al., 2020).
The above process is referred to as fine-tuning or transfer learning and entails modifying the parameters of the pretrained model to adapt to the statistical properties of the new corpus. Fine-tuning Figure 1: Study workflow. 5.5M reviews were collected, of which a small subset were manually labeled via crowdsourcing using multiple labelers per review. A larger set of reviews was used for language model fine-tuning, and the full set was used for uncovering domain-specific insights.
has been shown to improve learned representations and consequently downstream predictions on numerous domain-specific corpora without requiring large-scale labeling (Elwany et al., 2019;Lee et al., 2020), thus opening the possibility of employing these techniques in different applications with relative ease, including tenant reviews classification.
Prior NLP-based efforts on online reviews have used both classical (Hu and Liu, 2004;de Kok et al., 2018) and deep learning-learned representations (Xu et al., 2019) to extract sentiment polarity and/or classify reviews (Pontiki et al., 2014a,b). One popular group of methods, known as aspect-based sentiment analysis (ABSA), attempts to combine these two tasks by evaluating sentiment polarity with respect to specific aspects (Poria et al., 2020). One notable example in the real-estate domain performed a local analysis of 7,673 neighborhoodlevel reviews in New York City using ABSA and topic modeling .
A commonality across many review classification efforts is that the review classes are generally broadly defined. However, carefully-tuned class definitions are often of high value to practitioners. For such cases, an approach that goes beyond coarse-grained classification may be beneficial.
In this paper, we analyze a dataset of nearly 5.5 million tenant reviews from multiple online sources, covering tens of thousands of multifamily properties in the US. After analyzing the textual characteristics of this unique corpus, we describe an iterative crowdsourcing-based approach to ensure accurate labeling of a random sample of re-views for multiple, non-mutually exclusive classes. We then show how, using state-of-the-art NLP techniques, we label millions of reviews using a model that was trained on a few thousand annotated samples, and that the labeled corpus provides important insights on spatiotemporal trends affecting the real estate market (Fig 1).

Corpus
The data used in this study consisted of 5,468,037 online tenant reviews gathered from five different sources, covering approximately 96,134 different US multifamily properties 1 and spanning 21 years from 2000 -2020 (Table 1). The total number of words in the corpus was 536,702,874, amounting to 14% of the size of Wikipedia as determined on April 1st, 2021. The contribution of the five sources to the total number of reviews varied from 2.3% to 52% of the corpus, with the largest two sources accounting for 91% of the reviews. 99.2% of the reviews in the corpus are written in English as estimated using the langdetect Python library 2 .
The data for each review consisted of the review body text and metadata containing the date and the specific property associated with the review. The distribution of reviews per property was highly skewed as was the distribution of words per review (Fig 2a and 2b). The majority of the reviews (66%) were from recent years (2015-2020), consis- tent with the increasing popularity of online media and the digitization of commercial real estate ( Fig  2c). Geographically, reviews showed nation-wide coverage, with Texas having the largest number of reviews, both in absolute and relative (per-capita) terms (Fig 2d). The reviews varied significantly in their sentiment and linguistic style. While the majority of the reviews were positive -"The [property name] staff are great and the residents are nice. It is a quiet and safe place to live", some expressed anger and frustration with the property, its surroundings, or its management -"This place Is horrible I would not alow my dogs to live their, drugs being sold and apartments getting robbed stay away from these people".
We randomly sampled 500 reviews and 500 Wikipedia articles of similar lengths to measure the statistical discrepancy between the reviews corpus and a more general corpus such as Wikipedia. Correspondingly, we obtained 1000 document embeddings using fastText (Joulin et al., 2017), for which we computed the pairwise Euclidean distance matrix between embeddings (Fig 2e). The block-diagonal structure of the resulting dissimilarity matrix implied that the model representations of reviews were clustered compared to random articles, reflecting their statistical and linguistic idiosyncrasies. This suggested the importance of fine-tuning a pretrained language model to the reviews corpus -see Section 4.

Data Labeling
We labeled 0.1% (5,500) of the reviews in order to train models that can detect four detrimental quality-of-life issues. If accurate, these models may enable domain-specific analysis of the entire corpus, especially when paired with property-level geographical and temporal metadata.

Label Selection
We decided to focus on four issues which are of high interest to real estate professionals after consultation with multiple domain experts. The selected issues are often hard to identify using traditional data sources and are typically difficult and expensive to remedy. The four chosen labels were: • Crime and violence: Have violent or severe crimes occurred at the property or very close by?
• Noise issues / thin walls: Are there constant noise issues at the property, either due to environmental or structural reasons?
• Pests / vermin: Are pests, roaches and vermin a significant and constant concern for residents?
• Parking: Are there not enough parking spaces for residents in the property and its immediate surrounding?
As a single review can contain more than one label, or none at all, this postulates a multilabel classification problem.

Crowdsourcing
As accurate manual labeling all of the reviews was impractical due to the size of the corpus, we randomly sampled a subset of 5,500 reviews (0.1% of the corpus) with the intention of generating a small amount of high-quality labels. We considered labels to be high-quality when they were precisely aligned with both the detailed definitions given above as well the specific positive and negative examples provided to the labelers. These labels would later be used for downstream model training and evaluation. We first conducted a series of single-label crowdsourcing experiments, each with 1000 reviews, to refine the exact instructions provided for each label and to choose a labeling vendor. The experiments comprised multiple labeling vendors, had between three to nine labelers per review, and were conducted using the AWS GroundTruth platform. Disagreements between different label providers were assessed to detect systematic differences ( Fig S1). As an example, in one pilot experiment, labelers were instructed to label reviews that mention breakins; while labelers from one vendor interpreted this as solely apartment break-ins, other vendors (e) Sentence dissimilarity, as measured by Euclidean distance between document embeddings, between 500 randomly sampled reviews and 500 randomly sampled Wikipedia articles. Reviews are generally more similar to other reviews, and statistically different than random articles. also included reviews that refer to vehicle breakins. These discrepancy comparisons enabled to detect ambiguities in our instructions and helped refine subsequent experiments. Afterwards, we conducted multilabel pilots with three top performing vendors, as assessed by consensus labeling and manual review of discordantly labeled reviews in the single label pilots, to choose the vendor with which we will proceed.
We next designed a two-stage crowdsourcing pipeline to ensure label quality (Fig 1). In the first stage, all 5,500 reviews were seen by three different labelers that provided an annotation for each of the four classes. 4,580 (83%) of the reviews had consensus among the three labelers in all four classes, for example all three labelers agreed that there was no crime, no noise, there were pest issues, and there were no parking issues. To gain more confidence in the remaining 920 reviews that were not unanimously labeled, we passed them through to a second crowdsourcing stage with six additional labelers, focusing on the specific label(s) in which there was disagreement. The final label in the 2-stage scenario was given by a majority vote among the nine labelers. This iterative approach  Table 2: Abundance of each positive label within the set of labeled reviews. Total unique reviews -5,500. Some reviews can have more than one label and thus the percentages sum to slightly more than 1.
was cost-effective as reviews for which there was a consensus were pruned, thus more labeling resources were placed on ambiguous reviews. Table  2 shows the distribution of the crowdsourced labels, of which 88.8% were None.

Modeling Details
We trained the review classifier in two steps using the 5,500 labeled reviews. First, we finetuned a pretrained model for 10 epochs (Adam optimizer, batch size 8, learning rate 10 −5 ). The pretrained model was either RoBERTa  or DistilBERT . Each model was trained (unsupervised) on a random sample of 3M reviews that did not overlap with the 5,500 labeled reviews using a single GPU on an AWS ml.p3.2xlarge instance. Pretrained models were based on HuggingFace implementations , and the training was done using PyTorch (Paszke et al., 2019). Second, we trained a multilabel classifier downstream to the fine-tuned model on the set of 5,500 labeled reviews without freezing the encoder layers. The classifier consisted of a dense layer with a hyperbolic tangent activation function and 768 hidden units, a dropout layer (p=0.1), and another dense output layer with one output neuron for each label. We used binary cross-entropy (logit scale) as our loss function, averaged over the different labels. Model results were evaluated via 5-fold cross-validation.

Modeling Results
We computed the cross-validated area under the receiver operating characteristic curve (AUROC) for each of the labels to estimate model predictive accuracy. The AUROC scores stabilized for 3 out of 4 labels at around 3000 samples, as shown via learning curves (Fig S2). Due to the sparsity of the labels, there was variability between folds, with fine-tuning improving both the average and the variance across folds. The plateauing AUROC suggested diminishing returns for obtaining additional labeled reviews. Finally, the neural models had a strong tendency to overfit the train set as observed by fitting the models to permuted labels, stressing the importance of cross-validation in performance estimation ( Fig S3). Interestingly, despite the fact that the model was trained on binary labels (chosen via majority voting between labelers), model prediction were highly correlated with labeler uncertainty (Fig 3). This suggests that the model predicted probabilities may be used to learn the inherent ambiguity in label definitions.
In Table 3, we provide the AUROC, as well as average precision and F1 score for different models trained on our labeled dataset. Numbers represent the average cross-validated scores using the probabilistic, not thresholded, predictions, except for F1 in which we chose the optimal threshold (separately for each model and label). Fine-tuned models outperformed the base model for both DistilBERT and RoBERTa, and were also better calibrated, as evident by Brier score (see Table S1). As baselines, we also provide comparisons to fastText, an efficient C++ implementation of a Bag-of-Wordsbased classification algorithm (Joulin et al., 2016), Fraction of positive labels Mean prediction probabilities (logit scale) Figure 3: Model predicted probabilities match labelers uncertainty. The mean predicted probability (logit scale) is plotted against the ratio of labeler disagreement, as defined by the fraction of positive labels (ranging from 0/9 to 9/9), averaged over all 5,500 reviews and 4 labels. 0 or 1 on the x-axis indicates full agreement among labelers. and to a BERT-based, ABSA classification model 3 . The latter model is composed of a HuggingFace implementation of a BERT model , pretrained on SemEval 2014, Task 4 (Pontiki et al., 2014a), a subsequent dropout layer, and a dense classification layer, and was not post-trained on the crowdsourced labels. Negative sentiment was evaluated on four aspects corresponding to the labels "crime", "noise", "pests", and "parking", and serves as a benchmark for the performance of an unsupervised approach.
Fine-tuning the pretrained base models improved results across all four labels, both for DistilBERT and RoBERTa. This suggests the presence of differences in statistical properties between our corpus and the concatenation of Wikipedia and the Toronto Book Corpus, on which both DistilBERT and RoBERTa were trained. In contrast, there was no substantial difference in results between finetuned RoBERTa and fine-tuned DistilBERT when considering all labels.
We conducted error analysis by manual examination of the subset of the 5,500 labeled reviews with the highest disagreement between model output scores and labeler annotations. For each label, we investigated the 10 highest model output scores in which the annotation was negative and the 10 lowest model scores with positive annotations. We found no systematic bias among these reviews, and generally agreed with the labels given by human  annotators, especially for reviews with positive annotations.
After verifying the accuracy of the model, we proceeded to use the RoBERTa fine-tuned model to predict the labels of all 5.5M reviews. This created what is, to the best of our knowledge, the largest labeled reviews dataset in the field of commercial real estate.

Association of Model Predictions with Property and Demographic Data
Model predictions on the review corpus, together with review metadata, enabled us to analyze nationwide multifamily housing trends from a tenantperspective. Below are select examples that demonstrate associations between automatically identified issues in reviews and property-level or geographic level data. One natural question to ask was to what extent model scores correlated with established property quality metrics. One commonly used metric is asset grade, which ranges from A (best) to D (worst), and reflects where the property falls across the quality spectrum relative to its U.S. Census-defined geographic area (source: Axiometrics). We computed the mean scores per asset grade for all properties in which an asset grade was obtainable (23,912 properties). Higher grade properties were found to have less crime and pest issues in their reviews, as expected (Fig 4a). In contrast, no strong association existed between noise or parking scores and asset grade. A similar behavior was observed when comparing model scores to property expense ratios, which refers to the ratio of operating expenses to gross revenue (sources: Fannie Mae and Freddie Mac) (Fig 4b).
We additionally investigated whether the tenant reviews reveal geotemporal trends in the data. We compared predicted review scores against the year built of each property in our dataset as newer properties are typically of higher quality. The analysis was conducted for 64,810 properties that were built after 1970 (sources: multiple). Indeed, we found that newer properties had fewer issues across all labels, however the improvement only commenced in the past decade for noise and parking issues, in contrast to crime and pest problems (Fig 4c). Spatially, we compared per-city average crime scores from the reviews (mean predicted crime score across all the reviews from 2015-2017 for properties in a given city) against nationwide public FBI crime reports from 2017 4 , which are at the city level. The FBI report covered 4 different types of violent crimes and 4 different types of property-specific crimes, and there was a strong positive correlation between levels of various crime categories across cities (mean Pearson correlation between different crime types is 0.6). Fig 4d shows an example for a single crime category, motor vehicle theft.

Discussion
In this study, we applied NLP-techniques to investigate a unique dataset of millions of online tenant reviews. We demonstrated that tenant reviews have idiosyncratic textual and statistical properties, differentiating them from other commonly used textual datasets. We further presented a resource-effective multi-labeling approach, and showed that using a limited set of high quality labels can achieve excellent results in a previously little studied domain. Finally, we illustrated that NLP-based scores are informative, as verified by domain-specific validations, and can be used to study financial, demographic, geographical and temporal trends in a quantitative way.
Our work is in line with prior observations that with a relatively small number of labels, fine-tuned language models can be trained to accurately predict human annotations in novel corpora (Yu et al., 2018). Although we focused on four key labels of interest, we expect this approach will generalize to other informative labels such as maintenance issues, management-related concerns, and renovation needs. Additionally, while our analysis bears similarity to aspect-based sentiment analysis (Xu et al., 2019), the class definitions used are more precise. For example, a review that mentions a single event of a pest sighting in a property might demonstrate a negative sentiment towards pests, but is not necessarily indicative of a recurrent problem in the property as we defined in labeling instructions.
Domain-specific validation serves as an orthogonal means for validating model usefulness. Encouragingly, model predictions often correlated with prior domain knowledge: crime and pest issues were higher in lower grade properties, all four labels improved in newer properties, and cities with higher crime rates had a higher amount of crimerelated reviews. These serve as secondary validations that strengthens our conviction in the value of model predictions.
Our results reveal differences between crime and pests issues versus parking and noise issues in relation to external, non-review data. Whether this is an artifact, for instance due to the latter two being sparser labels, or whether it is a true real estate phenomenon warrants further investigation. One potential explanation may be variation in tenant base. For example, tenants in grade A properties may be more sensitive to noise and parking issues, and thus lower noise levels may receive increased mention. Construction-wise, the evolution of building standards may be associated with the differences in pest, noise, and parking issue mentions in newer buildings. Finally, demographic changes may also be linked to the strong reduction in crime mentions with newer year builds.
One concern when analyzing online reviews is the potential presence of fake or solicited reviews.
Non-authentic reviews can bias the average score of a given property, in turn compromising the accuracy of downstream inferences. While online review sites have made large efforts to ensure review authenticity, there is nonetheless a risk. Initial results indicate that NLP-based analysis might help in identifying these reviews (Abri et al., 2020); applying this to our dataset and investigating the sensitivity of the results to such preprocessing is a potentially exciting future direction.

Conclusion
The use of AI and non-traditional data in commercial real estate is expected to have far-reaching implications. Our work contributes to this broader scope by highlighting how online tenant reviews, which have become ubiquitous, can uncover valuable insights that support both real estate investment decisions and research into local and nationwide housing trends.

A Assessing vendor disagreement
Vendor A labels Vendor B labels Figure S1: Assessing vendor disagreement. An example of vendor comparison for a single pilot crowdsourcing experiment with 1000 reviews. The final label for each review was chosen using a majority vote between the labelers. In the case of a tie among 3 labelers the final label was set as "Not sure" (the case of 1 "Yes", 1 "No" and 1 "Not sure"). Manual analysis of vendor differences focused on reviews that were majority labeled as "Yes" by one vendor and "No" by the other vendor, which in this experiment was 29 and 31 reviews (top right and bottom left in the figure).  Figure S2: Diminishing effect of increasing train set size (learning curves). We trained the fine-tuned RoBERTa text classification model using increasing amounts of training examples (from 150 to 4,400), while keeping the test set size fixed at 1,100 and using the same test reviews in each case. 5-fold CV was used for evaluation (220 test samples per fold). The filled area represents standard deviation over 5 folds. The black dots and gray area represent means and standard deviations in the non fine-tuned model. While the variability between folds is large likely due to test set size, the benefit of increasing the train set size beyond 3000 samples appears small for 3 out of 4 labels (results for "Parking" were too noisy to infer this).
C ROC curves on permuted labels Figure S3: ROC curves on permuted labels. We trained the fine-tuned RoBERTa text classification model for 5 epochs (all other parameters are as described in the main text) on permuted labels (each label was permuted differently). Red lines correspond to ROC curves on the training set (for 5 different folds), black lines -test set. The model shows significant overfitting to the train set already after 5 epochs. 0.014 0.018 0.016 0.012 Table S1: Brier loss per model. Loss is averaged across 5 folds (see main text).