WHOSe Heritage: Classiﬁcation of UNESCO World Heritage Statements of “Outstanding Universal Value” with Soft Labels

The UNESCO World Heritage List (WHL) includes the exceptionally valuable cultural and natural heritage to be preserved for mankind. Evaluating and justifying the Outstanding Universal Value (OUV) is essential for each site inscribed in the WHL, and yet a complex task, even for experts, since the selection criteria of OUV are not mutually exclusive. Furthermore, manual annotation of heritage values and attributes from multi-source textual data, which is currently dominant in heritage studies, is knowledge-demanding and time-consuming, impeding systematic analysis of such authoritative documents in terms of their implications on heritage management. This study applies state-of-the-art NLP models to build a classiﬁer on a new dataset containing Statements of OUV, seeking an explainable and scalable automation tool to facilitate the nomination, evaluation, research, and monitoring processes of World Heritage sites. Label smoothing is innovatively adapted to improve the model performance by adding prior inter-class relationship knowledge to generate soft labels. The study shows that the best models ﬁne-tuned from BERT and ULMFiT can reach 94.3% top-3 accuracy. A human study with expert evaluation on the model prediction shows that the models are sufﬁciently generalizable. The study is promising to be further developed and applied in heritage research and practice. 1


Introduction
Since the World Heritage Convention was adopted in 1972, 1121 sites has been inscribed worldwide in the World Heritage List (WHL) up to 2019, aiming at a collective protection of the cultural and natural heritage of Outstanding Universal Value (OUV) for mankind as a whole (UNESCO, 1972;von Droste, 2011;Pereira Roders and van Oers, 2011).First proposed in 1976, OUV, meaning the "cultural and/or natural significance which is so exceptional as to transcend national boundaries and to be of common importance for present and future generations of all humanity", has been operationalized and formalized into an administrative requirement for new inscriptions on the WHL since 2005.(UNESCO, 2008;Jokilehto, 2006Jokilehto, , 2008)).All nominations must meet one or more of the ten selection criteria (6 for culture and 4 for nature), focusing on different cultural and natural values.
Since 2007, complete Statements of OUV (SOUV) need to be submitted and approved for new World Heritage (WH) nominations, which should include, among others, a section of "justification for criteria", giving a short paragraph to explain why a site (also known as property) satisfies each of the criteria it is inscribed under.These statements are to be drafted by the State Parties after scientific research for any tentative nominations, further reviewed and revised by the Advisory Bodies from ICOMOS and/or IUCN, and eventually approved and adopted by the World Heritage Committee for inscription.Similarly, Retrospective SOUV have been required for sites inscribed before 2006 to revise or refill the section justification of criteria (IUCN et al., 2010).However, the evaluation of SOUV can be ambiguous in the sense that: 1) the selection criteria are not mutually exclusive and contain common information about historical and aesthetic/artistic values as an integral part (Jokilehto, 2008); 2) the key stakeholders to evaluate the SOUV for a nomination occasionally disagree with each other at early stages, leading to recursive reviews and revisions, though all are considered to be domain experts (Jokilehto, 2008;Tarrafa Silva and Pereira Roders, 2010;von Droste, 2011).A tool to check the accuracy, objectivity, consistency, and coherence of such statements can significantly benefit the inscription process involving thousands of experts worldwide each year.
Not only for new nominations, the SOUV are also essential reference points for monitoring and interpreting inscribed heritage sites (IUCN et al., 2010).Researchers and practitioners actively and regularly check if the justified criteria are still relevant for the sites, as to decide on further planning and managerial actions.Moreover, these same statements are also used in support of legal court cases, should WH sites be endangered by human development (Pereira Roders, 2010;von Droste, 2011).Under the support of the Recommendation of Historic Urban Landscape and the recent Our World Heritage campaign, multiple data sources (e.g., news articles, policy documents, social media posts) are encouraged in such analyses of identifying and mapping OUV (UNESCO, 2011; Bandarin and van Oers, 2012;Ginzarly et al., 2019).The traditional method of manually annotating heritage values and attributes by experts can be time-consuming and knowledge-demanding for analysing massive social media posts by people in cities with WH sites to find OUV-related statements, albeit dominantly applied in practice (Tarrafa Silva and Pereira Roders, 2012;Abdel Tawab, 2019;Tarrafa Silva and Pereira Roders, 2010).
To approximate both ultimate goals of this study: 1) aiding the inscription process by checking the coherence and consistency of SOUV, and 2) identifying heritage values from multiple data sources (e.g., social media posts), a computational solution rooted on SOUV is desired.By training NLP models with the officially written and approved SOUV, a machine replica of the collective authoritarian view could be obtained.This machine replica will not be employed at this stage to justify OUV for new nominations from scratch.Rather, it will assess the written SOUV of WH sites (either existing or new) and classify OUV-related texts with the learned collective authoritarian view.Furthermore, it can investigate the existing SOUV from bottom up and capture the subtle intrinsic associations within the statements and among the corresponding selection criteria (Bai et al., 2021a).This yields a new perspective on interpreting the WHL, which would give insights for furthering amending the concept of OUV and selection criteria to be better discernible.
Therefore, this study aims at training an explainable and scalable classifier that can reveal the intrinsic associations of World Heritage OUV selection criteria, which can be feasible to apply in realworld analyses by researchers and practitioners.As outcome, this paper presents the classifier of UN-ESCO World Heritage Statements of OUV with Soft Labels (WHOSe Heritage).
The contributions of this Paper can be summarized as follows: 1) A novel text classification dataset is presented, concerning a domain-specific task about Outstanding Universal Value for UN-ESCO World Heritage sites; 2) Innovative variants of label smoothing are applied to introduce the prior knowledge of label association into training as soft labels, which turned out effective to improve performance in most investigated popular models as baselines in this task; 3) Several classifiers are trained and compared on the Statements of OUV classification task as initial benchmarks, supplemented with explorations on their explainability and generalizability using expert evaluation.

Related Work
Text classification In the past decades, numerous models have been proposed from shallow to deep learning models for text classification tasks.In shallow learning models, the raw input text is pre-processed to extract features of the text, which are then fed into machine learning classifiers, e.g., Naive Bayes (Maron, 1961) and support vector machine (Joachims, 1998) for prediction.In deep learning models, deep neural networks are leveraged to extract information from the input data, such as convolutional neural networks (CNN) (Kim, 2014;Johnson and Zhang, 2017), recurrent neural networks (RNN) (Tai et al., 2015;Cho et al., 2014), attention networks (Yang et al., 2016) and Transformers (Devlin et al., 2019).Multi-class and multi-label tasks are two extensions of the simplest binary classification, where every sample can belong to one or more classes within a class list (Aly, 2005;Tsoumakas and Katakis, 2007), where the labels may also be correlated (Pal et al., 2020).This work explores the combined application of some popular shallow and deep learning models for a multi-class classification task.
Label Smoothing Label smoothing (LS) is originally proposed as a regularization technique to alleviate overfitting in training deep neural networks (Szegedy et al., 2016;Müller et al., 2019).It assigns a noise distribution on all the labels to prevent the model from predicting too confidently on 'ground-truth' labels.It is widely used in computer vision (Szegedy et al., 2016), speech (Chorowski and Jaitly, 2017) and natural language processing (Vaswani et al., 2017) tasks.Originally the distribution is uniform across the labels, which is data independent.Recently, other variants of LS are also proposed that are able to incorporate the interrelation information from the data into the distribution (Zhong et al., 2016;Zhang et al., 2021;Krothapalli and Abbott, 2020).In this work, the technique is applied to generate soft labels with a distribution derived from domain knowledge since the classes in this task are clearly interrelated with each other.
Transfer Learning in NLP In many real-world applications, labelled data are limited and expensive to collect.Training models with limited data from scratch affects the performance.Transfer learning (Pan and Yang, 2010) is widely used to solve this by using word embeddings that are pretrained on massive corpus and fine-tuning them on target task.Earlier works (Mikolov et al., 2013;Pennington et al., 2014) provide static word embeddings that ignore the contextual information in the sentences.More recent works, e.g., ULM-FiT (Howard and Ruder, 2018) and BERT (Devlin et al., 2019), take the context into account and generate dynamic contextualized word vectors, showing excellent performance, which also prove to be sufficiently generalizable across many tasks.This task, with a relatively small data size, employs the idea of transfer learning and applies both embedding methods.
3 Data and Problem Statement 3.1 Data Collection and Pre-processing UNESCO World Heritage Centre openly releases a syndication dataset of the sites in XLS format 2 , which includes information of the inscribed World Heritage sites such as ID, name, short description, justification of criteria et.al..Among them, the field of justification provides a paragraph for each selection criterion the site fulfills 3 , contributing as the input data for this task.In total, 1052 out of 1121 WH sites contain the justification data 4 , while the remaining 69 await the Retrospective SOUV to be approved as introduced in Section 1.As an example, in Venice and Its Lagoon, the paragraph on criterion (i) shows: 2 http://whc.unesco.org/en/syndication.Copyright © 1992 -2021 UNESCO/World Heritage Centre.All rights reserved.
3 This field is not complete in the original XLS dataset.The WHC website is walked through to fill in the missing values. 4The statistics are up to the 44th session of the World Heritage Committee held in Fuzhou, China in July 2021, after which the total number of WH sites grew to 1154.
...The lagoon of Venice also has one of the highest concentrations of masterpieces in the world: from Torcello's Cathedral to the church of Santa Maria della Salute.The years of the Republic's extraordinary Golden Age are represented by monuments of incomparable beauty...5 For any inscribed WH site p i ∈ P , where P is the set of all the sites, it may fulfill one or more of the ten selection criteria.By checking if each criterion is justified for the site p i , a non-negative vector κ], κ = 10 can be formed as the "parental" label for the site: Meanwhile, the paragraphs X i in the justification field of p i , describing all criteria that p i has, are split into sentences.For the j th sentence x i,j,k describing the criterion k possessed by the site p i , a non-negative one-hot vector y i,j,k can be formed as the "ground-truth" label for this single sentence: (2) Each sentence x i,j,k ∈ X i is treated as a sample, with two labels: a one-hot "ground-truth label" y i,j,k for the particular sentence, and a multi-class "parental label" γ i for all sentences that belong to the site p i .The sentence-level setup is desirable here since paragraphs may contain overwhelming information of multiple OUV criteria, as will be shown in Section 3.2.As such, a more specific indication of OUV tendencies in each part of the texts could be differentiated.Complementarily, the finegrained sentence-level prediction vectors could still be aggregated into paragraph/text levels without losing lower-level details, which will be demonstrated in Figure 2. As the sentences were written, revised, and approved by various domain experts at local and global levels during the inscription process, the labels can be considered as having a good "inter-annotator agreement" (Jokilehto, 2008;Nowak and Rüger, 2010).
The following data pre-processing techniques are applied to construct the final dataset used for training: 1) all letters are turned into lower-case; 2) the umlauts and accents are normalized; 3) numbers are replaced with a special < NUM > token; 4) only sentences with a length between 8 and 64 words are kept, based on the dataset distribution; 5) the sentences are randomly split  1: The number of samples in sentence level that contain each criterion as a label, annotated with C1 to C6 for cultural values and N7 to N10 for natural values.The first three rows show the data split using the field justification; the fourth row shows a new dataset only for testing using the field short description (SD); the last row shows the potential samples the models can see for each criterion after introducing label smoothing (LS).
into train/validation/test sets with a proportion of 8:1:1.Additionally, the official definition sentences of selection criteria 6 as given in Table 4 of Appendix A are respectively appended into the train split with the same one-hot sentence and parental labels for each criterion.Stop-words are not removed since BERT and ULMFiT to be applied generally prefer natural texts with context information.Furthermore, an additional 11 th class "Others" is introduced by appending an arbitrary noise of γ i,κ+1 = 0.2 to all parental labels γ i , and a 0 to all "ground-truth" labels y i,j,k , so that the models are not forced to give predictions only to the ten criteria even when the relevance to all of them is weak.For each sentence, the 11 th "Others" class and the complement sets of its parental labels could be regarded as the negative classes for classification since the site this sentence describes is not justified with those values.An exemplary pre-processed data sample is shown in Table 6 in Appendix A.
On average, 27.97 ± 11.04 words appear in each sentence.A summary of the number of samples in sentence level in each split for each criterion is presented in the first three rows of Table 1.
Similarly, the paragraphs S i in the field short description of WH site p i , giving a general introduction of the site, which are not originally written to describe any specific OUV selection criterion, are pre-processed into an additional independent test dataset SD to evaluate the generalizability of the classifiers on unseen data that comes from a slightly different distribution.For those sentences s i,o ∈ S i , both ground-truth and parental labels are the same as γ i for the site they describe.The total number of samples that contain each criterion in SD dataset is shown in the fourth row of Table 1.

Association between Classes
Jokilehto (2008) summarized the selection criteria with their main focuses by inspecting the official 6 http://whc.unesco.org/en/criteria/definitions and the justification texts of WH sites.Details about the definitions of the criteria could be found in Appendix A. However, as stated in Section 1, the criteria are not mutually exclusive.The criterion (i) justification of Venice in Section 3.1 will be again used as an example.Judging as a domain expert, it clearly describes criterion (i) as labelled, since it explicitly uses the term "masterpieces" and "monuments of incomparable beauty".However, traces can still be found on other values: 1) as it describes the "Cathedral", "church", and "monuments", it also concerns the criterion (iv) about architectural typology; 2) as it talks about the "Golden Age", it also points to criterion (ii) about influence and criterion (iii) about testimony.In fact, Venice is also justified with criteria (ii), (iii), and (iv).Pragmatically speaking, for sites fulfilling more than one OUV selection criteria, it is hard to avoid talking about the other criteria while isolating one criterion alone (Pereira Roders, 2010).
Furthermore, the association between each pair of criteria can be different.The distinction between criteria is generally larger when the pair comes from a different category (cultural v.s.natural).For a pair of criteria from the same category, the association level can also vary.For example, Jokilehto ( 2008) pointed out that "criteria (i) and (ii) can reinforce each other while (iv) is often used as an alternative".This complex association pattern can also be seen in the co-occurrence matrix criteria in all the inscribed sites P , where the diagonal entries record the number of cases when each criterion is used alone (shown in Figure 4 of Appendix A): This intrinsic association is to be used as the prior knowledge for the classification task.

Models and Experiments
4.1 Soft Labels Generation Section 3.2 argues that the selection criteria are not mutually exclusive, and that co-justified criteria of a WH site that have a stronger association may be reflected in the sentences describing a specific criterion.In other words, classifying such sentences is not purely a single-label multi-class classification task.Rather, it also has a multi-label characteristic considering the "parental labels" of the sites.
To leverage the problem between the two sorts of tasks and to prevent the models from being overconfident at the only "ground-truth" labels, this paper proposes to apply the label smoothing (LS) technique with two novel variants to combine the "ground-truth" sentence label y i,j,k and the parental document label γ i into a single vector y i,j,k as soft labels for training process.This is similar to the hierarchical LS approach proposed by (Zhong et al., 2016) to reflect the prior label similarity distribution.We propose three variants: vanilla that assigns identical "noises" to all classes, which will be proved equivalent to the original LS in Appendix B; uniform that treats all co-justified associated criteria in the parental label equally; and prior that weights the co-justified criteria based on the frequency that the pair co-occurs in matrix A κ×κ : (4) Here f : R d + → [0, 1] d is a variant of the original softmax function so that it maps a d−dimensional vector of non-negative real numbers to a distribution that sums up to 1: is a criterion-specific non-negative vector showing the inter-criteria associations: and represents the element-wise Hadamard-Schur product of vectors.This variant of the softmax function introduced in Equation 5 is preferable since it transforms the combined non-negative labels-vectors in Equation 4 to a "probability" distribution while keeping non-related labels still as 0. For example, a combined vector [2, 0, 1, 0] T becomes [.62, .08,.22,.08]T with normal softmax, and [.79, 0, .21,0] T with this variant.
All three variants are considered as options during training, and tuned as hyperparameters together with the scalar α ∈ {0, 0.01, 0.05, 0.1, 0.2, 0.5, 1}.For all variants, the problem is purely multi-class when α = 0, and approaches multi-label when α gets larger, giving parental labels larger weights.
The following benefits can be achieved with the use of proposed LS variants: 1) The knowledge of the actual association of classes (selection criteria) are introduced into the training in both uniform and prior variants, giving the model chances to learn these intrinsic associations with soft labels; 2) The freedom on the design decision of whether the problem should be multi-class or multi-label is provided for the model training process; 3) The models can potentially see more instances for each class during training with LS variants, as shown in the last row of Table 1; 4) The computed soft label vector y i,j,k is mathematically more similar to the prediction vector y i,j,k than one-hot vectors, both of which are discrete "probability" distributions, pushing the use of Cross-entropy Loss closer to its original definition (Rubinstein and Kroese, 2013).

Metrics
For the training process, Cross-Entropy is used as the loss-function for two soft label vectors, while three metrics are used to evaluate the model per-formance as a multi-class classification task: 1) Top-1 Accuracy which counts the instances when the predicted class with the highest output value matches the ground-truth sentence label; 2) Topk Accuracy which counts the instances when the ground-truth sentence label is among the top k predicted classes with the highest output values; 3) Macro-averaged F1 which calculates the overall cross-label performance.Per-class Metrics (i.e., top-1 precision, recall, and F1) for each selection criteria are also calculated for evaluation purpose.
For the independent SD test set, two metrics are defined here to evaluate the model performance as a multi-label classification task: 1) Top-1 Match which counts the instances when at least one of the parental labels matches the predicted class; 2) Topk Match which counts the instances when at least one parental label is among the top k predicted classes.Arguably, the top-1 and top-k matches are more tolerant extensions of top-1 and top-k accuracy into multi-label classification scenarios.
For all evaluation metrics, k is chosen to be 3 following the rationale introduced in Appendix A.

Experiment Setup
The experiment consists of three successive steps for each baseline (details given in Appendix C): 1. Grid search within a small range is performed to tune the hyperparameters with a single random seed, and the best configuration is selected according to the top-k accuracy on the validation split; 2. LS with different α values under all three conditions (vanilla, uniform, and prior) is tested using the configuration from step 1, repeated with 10 different random seeds, treated as another round of hyperparameter tuning, saving the best LS configuration according to the performance mean and variance over the seeds; The uniform variant of LS with different α values appears in most models.A possible explanation is that uniform LS introduces the prior knowledge from the parental labels as "noise" in a simple way during the training, balancing yet not challenging the "ground-truth" sentence labels (Müller et al., 2019).Yet, the complex effect of LS on different baselines invites further investigation.
Table 2 shows the performance of the models with and without LS on the validation split, test split, and SD test set.Except for BoE, introducing LS increased the performance of most baselines in most metrics.Generally speaking, the pretrained models dominate the performance, and the highest score for all the metrics occurs in either ULMFiT or BERT, mostly with LS.Still, top-1 accuracy only reaches 71% in the best models, while top-k accuracy manages to reach 94%, suggesting that it would be more reliable to look at the top 3 predictions during application in this task.The models perform remarkably well in the SD test set, though given a relatively simpler task than in training, indicating the generalizability of the classifiers.
The per-class top-1 metrics of the best models in each baseline on the validation and test split (Table 3) make it evident that the difficulty for classifying each selection criterion varies.T -test shows that F1 score is significantly different between the cultural and natural criteria (t = 8.20, p < .001),suggesting that natural criteria are probably more clearly defined, while cultural ones might be closely intertwined.The poor performance on      criterion (v) is consistent with its smallest sample size (as shown in Table 1); meanwhile, the models perform reasonably well for criterion (viii) with the second smallest sample size.This suggests that except for sample size, the strong associations between the classes can also influence the difficulty for NLP models (and probably also for human experts) to distinguish the nuance of criteria.Criterion (i) has a far poorer precision than recall, suggesting that samples from other criteria, especially from criterion (iv) based on the confusion matrices shown in Figure 5 of Appendix D, are easily mistaken as this one.This is also comprehensible since criterion (i), emphasizing that a site is a masterpiece, can be easily mentioned "unintentionally" in the description of criterion (iv) that regards the value of some specific architectural typology.

Error Analysis and Explainability
Although sometimes challenged (Serrano and Smith, 2020), attention mechanisms are believed to be effective for visualizing NLP model perfor- mance in an explainable manner (Yang et al., 2016;Vaswani et al., 2017;Tang et al., 2019;Sun and Lu, 2020).The same example on OUV selection criterion (i) in Venice as in Section 3.1 and 3.2 will be demonstrated here using the trained models from the attention-enabled GRU+Attn and BERT, as shown in Figure 2, with the help of BertViz library (Vig, 2019;Vaswani et al., 2018).GRU+Attn employs a single universal attention mechanism to all inputs, while BERT has 12 attention heads for the [CLS] token on its last layer, both of which manage to capture the meaningful keywords and phrases such as masterpiece, church, golden age, monuments, and incomparable beauty in the sentences.As a note, Clark et al. (2019) used probing to find out that some BERT attention heads correspond to certain linguistic phenomena.In this study, the attention heads from the last layer also seem to focus on different semantic information of OUV.This observation invites further studies.
Figure 2 also shows the top-3 predictions of the models on the exemplary sentences.In the overall predictions taking the sentences as a paragraph for input, all models manage to give the groundtruth label criterion (i) the highest predicted value (from 0.32 in N-gram to 0.85 in BERT).Remarkably, all models also include criterion (iv) in the top-3 predictions (from 0.05 in GRU+Attn to 0.17 in N-gram), suggesting that the sentences might also be related to criterion (iv).The fine-grained predictions taking each sub-sentence as input, however, show a different pattern.Although criterion (i) is almost always present in the top-3 predictions, criterion (iv) shows to take a higher place in the second sentence by GRU+Attn, and in the third sentence by BERT.This behaviour is not necessarily an error per se in prediction.Rather, considering the arguments in Section 3.2, those sub-sentences could be indeed relevant to other criteria (in this case, criterion iv) based on the association pattern, q.v.Bai et al. (2021a), indicating why criterion (iv) is always included in the overall predictions.

Expert Evaluation
Eight heritage researchers with rich experience in identifying heritage values and attributes were invited for a human study adapted from He et al. (2021), Nguyen (2018) and Schuff (2020), to test the models' reliability and generalizability.They were presented with 56 sentences about Venice harvested from "Justification" ( 14) and "Brief Synthesis" (13) in SOUV and Social Media platforms (29).Each sentence was given three positive classes as top-1 and top-3 criteria predictions from BERT and ULMFiT models, and one negative class as another random cultural criterion.Not knowing that the criteria are predictions by computer models, the experts were asked to rate the relevance of the sentences and each criterion on a 5-point Likert scale.
The distributions of all the ratings are shown in Figure 3.For all data sources, the expert ratings for top-1 and top-3 predictions are significantly higher than those for negative classes based on Mann-Whitney U tests (See Table 8 in Appendix E).The average ratings of experts for each sentence-criterion pair show a strong correlation with the average confidence scores of models (r = 0.618, p < 0.001).Some heritage experts seem to be rather cautious and reserved to assess informal texts as "culturally significant" without further historical contexts and comparative studies.For example, the third sentence in Table 9 of Appendix E from social media, "In 1952, the station was finalized on a design by the architect Paul Perilli" with a predicted label of criterion (i) got extremely divergent expert scores.For some experts, it is clearly related to criterion (i) about masterpiece based on the semantic content.However, for the experts who rated a low score, merely declaring that some building is designed by a certain architect does not automatically entail that it is a masterpiece.Further investigations have to be made to fully convince them.Although such an example shows disagreement amongst the experts and between the experts and the computer models, it does not limit the machine's ability to differentiate positive and negative classes.Full details of the human study are presented in Appendix E. The expert evaluation proves that the models are sufficiently reliable and capable of identifying OUV-related statements even from the less formal social media data, useful for the ultimate motivations of this study discussed in Section 1.

Discussion and Conclusions
This paper presents a new text classification benchmark from a real-world problem about UNESCO World Heritage Statements of Outstanding Universal Value (OUV).The problem is essentially a multi-class single label classification task, while the classes are not necessarily mutually exclusive.The prior knowledge of the class association is added to the training process as soft labels through novel variants of label smoothing (LS).The study shows that introducing LS improved the performance on most baselines, reaching a top-3 accuracy of 94.3%.The models also performed reasonably well in an independent test dataset and received positive outcomes in a human study with domain experts, suggesting that the classifiers have the potential to be further developed and applied in the World Heritage research and practice.
LS was not tuned together with other hyperparameters during the training.Yet, it still showed an improvement in most baselines.However, the complex effect of LS on different baselines needs more investigation.The top-1 accuracy is limited even on the best models, which is not uncommon in the literature for non-binary multi-class classification when the labels are not sufficiently distinct (Sun et al., 2019).Applying data augmentation and training supplemental binary classifiers may improve the performance on difficult classes.The choice of replacing all numbers into < NUM > tokens might introduce both advantages and drawbacks in terms of semantic context and generalizability when historical dates might be crucial information, which invites more investigations.Moreover, more studies on the generalizability and reliability of the models on data from different distributions (e.g., from policy documents or news article) are needed before further application.This work would support a series of follow-up studies respectively exploring the intrinsic associations of OUV based on the models' behaviour (Bai et al., 2021a), application of the proposed methods in social media mining in Venice (Bai et al., 2021b), and generalizability in case studies worldwide.
This work is intended to aid, but not replace the workload of human stakeholders: for State Parties to identify OUV-related statements through documentation, for Advisory Bodies and WHC to review and revise the yearly nomination proposals, for researchers to investigate massive official discourse and user-generated content, and for the public to visually understand the values of Their World Heritage around them.Therefore, this work WHOSe Heritage can be another milestone for the digital transformation of World Heritage studies, aiming at a more socially inclusive future practice.

Broader Impact Statement
This work focuses on exploring and applying NLP techniques to a real-world application of cultural and natural World Heritage (WH) preservation for the sake of social good.The research is to aid the identification and justification of heritage values across the world for various stakeholders, including both heritage experts and lay-persons, through text classification, as is pointed out in Section 1 and 6.It can lead to better understandings of the OUV criteria and the association among them.
The dataset used in this work is collected by the author(s) from the public website of UNESCO World Heritage Centre via XLS syndication respecting the terms of use and copy rights.The description of the dataset is sufficiently revealed in section 3.1 and Appendix A. All labels used are based on the official OUV justification given by local and global heritage experts and involve no crowd workers or other new annotators.The dataset and the methods used in the paper do not contain demographic/identity characteristics.Once deployed, the model does not learn from user inputs, and it generates no harmful output to users.The expert evaluation involving human study was totally voluntary, did not collect any personal information, and the privacy of the experts was fully protected.Though initially unaware of the true purpose of the evaluation to reduce bias, the experts were explained with the study afterwards.
BERT and ULMFiT with LS proved to perform best in all investigated metrics.However, there is a trade-off to consider for real-world application.As claimed in Appendix D and Section 5.2, ULMFiT has a relatively shorter inference time compared to BERT, while BERT is potentially more explainable due to the attention mechanism.Both models might work optimally for different application scenarios.
Nevertheless, the interpretation of the classification result needs to be carefully conducted by researchers and practitioners, especially during policy decision-making on World Heritage for the social benefit of the entire human species.WH inscription and OUV justification are far more complicated than only reading written texts and identifying the described values.Rather, it is a systematic thematic study based on scientific research and always rooted in a COMPARATIVE study across the globe (Jokilehto, 2008).The actual decisions of including new nominations into the WHL have to be made by human with heritage investigations.This is also evident in the results of expert evaluation and during the open discussion about the exercise with invited experts.As stated in the example shown in Section 5.2, thorough heritage investigations are always needed to determine if a site truly justifies certain OUV selection criteria.Such investigations, however, would be out of the scope of our NLP application study investigating the semantic and syntactic content of written official documents.Therefore, a human has to be involved in the loop during application.
This study and the obtained NLP models are inherently less biased than manual annotation by a single expert in the sense that they avoid adding too much implicit personal experience into the written texts, and that the trained models represent the collective views of many human experts in the past.This can also be seen in some divergent evaluation outcomes by the eight invited experts, as demonstrated in Appendix E: though one specific expert may be more cautious and critical at a certain sample, the overall trend of all experts can consistently differentiate the positive and negative classes.However, the computational models trained on SOUV can also be a double-edged sword in the sense that they are highly dependent on the existing descriptions, which may contain historical unfairness.
Researchers and practitioners, especially those outside of the Computer Science field, need to be explicitly informed and even warned before usage on the limitations of such models, to avoid automation bias, which shows that people favour the results automatically generated from systems for decision-making (Parasuraman and Manzey, 2010).Wrongly under-judging the value of a WH nomination merely based on text classification results and consequently deferring or even refusing the inscription can cause a great loss to human culture in the worst scenario, as it can hamper its access to the available heritage management and conservation programs.Therefore, this work functions as a supplemental tool and reference for the understanding/evaluating of World Heritage OUV implied in text descriptions, which will and shall not replace the human effort and/or deviate the expert knowledge in WH decision-making process.Instead, it has two ultimate goals as use-cases: 1) aiding inscription processes by checking the coherence and/or consistency of OUV statements; 2) mining heritage-values-related texts from multiple data sources (e.g., social media).Bio-diversity To contain the most important and significant natural habitats for in-situ conservation 156 of biological diversity, including those containing threatened species of outstanding universal value from the point of view of science or conservation.
Table 4: The definition for each UNESCO World Heritage OUV selection criterion and its main topic according to UNESCO (2008), Jokilehto (2008), and Bai et al. (2021a).The last column shows the total number a criterion is justified with a WH site either uniquely or together with other criteria until 2019.Table 5: The distribution of the total number of selection criteria κ k=1 γ i,k a site is justified with.itself 7 .For example, cultural (criteria i-vi, also denoted as C1-C6) and natural (criteria vii-x, also denoted as N7-N10) OUV used to be justified apart as two sets.Since 2004, the two sets are combined.Although WH sites are usually justified with OUV from one category (cultural or natural), within the domain of mix heritage and cultural landscape, OUV from both categories can co-occur in one site (e.g., Mount Tai has all first seven criteria).
Association between Criteria Among all the 1121 sites inscribed in the World Heritage List up to 2019, only 188 are justified with only one criterion.The distribution of the total number of criteria justified for each site (i.e., κ k=1 γ i,k ) is shown in Table 5.This is an indication on the extend of how the problem characterizes a multi-label clas-7 http://whc.unesco.org/en/criteria/
sification nature.It is also the rationale behind the choice of k = 3 for the evaluation metrics Top-k Accuracy and Top-k Match, as 85.5% of sites are justified with no more than 3 criteria.Regardless of the number of co-justified criteria for each site, the co-occurrence matrix A κ×κ of all selection criteria is shown in Figure 4.The row-normalized A κ×κ becomes the source of the criterion-specific non-negative vectors µ k of the prior variant of Label Smoothing (LS), as is discussed in Section 4.1.
The criteria from the same category are co-justified more often, while criteria (ii-iv), (iii-iv), and (ii-iii) are the most frequently co-occurred pairs.
Dataset Example A data point concerning the WH site "Kalwaria Zebrzydowska: the Mannerist Architectural and Park Landscape Complex and Pilgrimage Park" in Poland justified with Criteria (ii) and (iv) is shown in Table 6, with the attributes of text data x i,j,k , sentence label as discrete index k, sentence label as one-hot vector y i,j,k (appended with 0 in the end for the class "Others"), parental label as vector γ i (appended with 0.2), sample length |x i,j,k | in terms of the number of tokens, index of parental WH site i, and the data split.

B Proof of the Equivalence
Here we will show that the Vanilla Label Smoothing (LS) defined in Equations 4 and 5 is equivalent to the original LS assigning noise to all classes.
Proof.The LS defined in Szegedy et al. (2016): could be rewritten as following to fit the context of mathematical notations in this paper: where y i,j,k is a one-hot vector of "ground-truth" label, K is the total number of classes (instead of κ + 1 in the paper for brevity and generality), is smoothing parameter as scalar, and 1 is a vector of 1s of size K × 1.
On the other hand, the Vanilla LS proposed in this paper could be written as: e y i,j,k +α1 − 1 e (y i,j,k +α1) T 1 − K .
(9) We will show that when the vectors in Equations 8 and 9 are the same.First, it is trivial that both the vectors are with the same shape of y i,j,k , i.e., K × 1, and that the sums of all entries in both vectors are 1; e.g., observe that the denominator of the right-hand side of Equation 9 is equal to the vectorised summation of the values of the nominator.
Second, we assume, without loss of generality, that the "ground-truth" of the one-hot vector y i,j,k is at its first entry, which means that y i,j,k = [1, 0, ..., 0] K×1 .Then both vectors could be rewritten as: where S := e 1+α + (K − 1)e α − K.
Substituting Equation 10 into the entries in Equation 11, the first entry could be rewritten as

S
. And the other entries could be rewritten as K = e α −1 S .Both types of entries are exactly the same as the ones shown in Equation 12.
Last, we will show that has a one-to-one relation with α based on Equation 10 when α ≥ 0. The partial derivative of with respect to α: is non-negative, suggesting that the function is monotonic.Furthermore, = 0 when α = 0, and lim = K e−1+K > 0 when α → +∞, suggesting that it is incremental.This means that a unique ∈ 0, K e−1+K always exists for any non-negative α and vice versa.

C Model Implementation Detail
For all baselines, Adam (Kingma and Ba, 2015) is used as the optimizer with L2 regularization.Hyperparameter tuning is conducted as grid-search within a small range for each one being searched (and/or selected according to common experience if not mentioned), based on the top-k accuracy on validation split with an early-stopping criterion of 5 epochs, if not explicitly mentioned below.The models are implemented in PyTorch (Rao and McMahan, 2019) and experiments are performed on NVIDIA Tesla P100 GPU (N-gram, GRU+Attn, BERT) and Intel Core i7-8850H CPU (BoE, ULM-FiT), respectively.

N-gram
The N-gram model used the TfidfVectorizer from Scikit-learn Python library to get an embedding vector of all 1-grams and 2-grams in the sample that appeared at least twice in the vocabulary.The embedding vectors are then fed in a 2-layer Multi-layer Perceptron (MLP) to get the model prediction.

BoE
The Bag-of-Embedding (BoE) model used the GloVe-6B-300d vectors8 as initial embeddings, which are set to be tunable during training.Only words that have a higher frequency than a threshold in the full dataset will be kept, while the others will be transformed to a special < UNK > token.The word embeddings of all words in the sentence is averaged before being fed to a 2-layer MLP.
GRU+Attn The GRU+Attn model also used the GloVe-6B-300d as embeddings, which are frozen during the training.The embedding sequence is then fed into a GRU network.Word-level attention (Yang et al., 2016) is applied to compute the sentence vector by a learned word context vector and the last hidden state of the GRU.The sentence vector is fed to a 1-layer feed-forward network for the output of the model.
ULMFiT The ULMFiT model employs the idea of Universal Language Model Fine-tuning from a general-domain pretrained language model on Wikitext-103 with AWD-LSTM architecture (Howard and Ruder, 2018).A domain-specific language model is then fine-tuned with the full UN-ESCO WHL dataset including SD using fastai API (Howard and Gugger, 2020).One epoch is trained with a learning rate of 1e-2, with only the last layer unfrozen, reaching a perplexity of 46.71.Then the entire model is unfrozen and further trained for 10 epochs, with a learning rate of 1e-3, obtaining a fine-tuned WH domain-specific language model reaching a 30.78 perplexity.Some examples of the language model at this step are shown here, starting with the given phrases marked in bold: This site is unique because it is the only example of a complex of karst complexes that is clearly recognised as being of outstanding universal value.The island of zanzibar has been inscribed as a world heritage site in <num>.The inscriptions, which bear witness to the civilisation of...This architecture has a special layout, especially in the form of the body of the building.The planet's primary feature is the addition of the ideal island, which lies at an elevation of <num>m above the sea floor, and is home to some <num>...The encoder of the fine-tuned language model is loaded in PyTorch followed by a Pooling Linear Classifier 9 for classifier fine-tuning.Gradual unfreezing is applied in a simplified manner to prevent catastrophic forgetting: 1) for the 1st epoch, only the decoder is unfrozen and trained with a learning rate of 2e-2; 2) for the 2nd to 4th epoch, one more layer is unfrozen each time and trained with a learning rate of 1e-2, 1e-3, and 1e-4, respectively; 3) from the 5th epoch onward, the full model is unfrozen and trained with a learning rate of 2e-5.An early-stopping criterion of 3 is applied.

Performance
No extensive hyperparameter tuning is performed since: 1) tuning ULMFiT is expensive on CPU; 2) the hyperparameter configuration from experience suggested by Howard and Gugger (2020) and Howard and Ruder (2018) already performs reasonably well; 3) the purpose of this study is not necessarily finding the best hyperparameter.The final model uses batch size of 64, L2 of 1e-5, and the default dropout rate for the decoder.
BERT The BERT model uses the uncased base model using The Transformers library (Wolf et al., 2020).The pooler output processed from the last hidden-state of the [CLS] token during pretraining is fed into a 1-layer feed-forward network to finetune the classifier (Sun et al., 2019).An earlystopping criterion of 10 is applied.

D Extended Model Performance
Resource and Time Table 7 shows some further information on the model performance in terms of training resource utilization, model size, and inference time.Training processes are conducted on CPU or GPU, respectively, while inference is fully conducted with CPU.
It can be noted that the best-performing models ULMFiT and BERT also consume the most resources, in terms of training time and infrastructure usage, and have the largest model sizes.Though most time-consuming during training, ULMFiT takes a remarkably short time for inference on CPU compared to BERT.This suggests that ULMFiT might be an optimal choice for further development and application when time is a critical matter.

Confusion Matrices
The confusion matrices of the best-performing ULMFiT and BERT models on the test split are shown in Figure 5.It can be seen that certain criteria are easily confused as the others, such as sentences with a "ground-truth" label of criterion (iv) can be confused as criteria (i), (ii), and (iii), and vice versa; while criterion (iii) might be confused easily as criterion (vi), but NOT vice versa.This complex association relationship is extensively discussed in Bai et al. (2021a).

E Expert Evaluation Details
Materials The materials about the WH site "Venice and Its Lagoon" for expert evaluation were harvested from three data sources: 1) all 14 sentences from Justification for Criteria section of Statements of OUV (SOUV), where each sentence has one "ground-truth" sentence label and a parental site label of Venice, which is also within the data X i used during model training and testing; 2) all 13 sentences from Brief Synthesis section of SOUV, where sentences only have the same multi-label parental label of Venice, which is similar with the SD test data S i used for generalization test; 3) Social Media data sampled from a total of 1687 social media posts where a textual description is written, collected from Flickr in the region of Venice with a resolution of 5km using Flickr API10 .Among the 1687 social media posts, there are 820 unique textual descriptions in English.By splitting the unique posts into sentences, removing HTML symbols, and filtering out the texts about camera parameters, image formats, and advertisements, 1132 sentences were obtained.The 1132 sentences were fed into the trained BERT and ULMFiT models.The sentences were further filtered based on the predictions: 1) the total confidence scores of top-3 predictions need to be larger than 0.8 by both models; 2) the Intersection over Union of top-3 predictions by two models needs to be larger than 0.5 (i.e., maximum one different predicted class).As a result, 388 Social Media sentences that potentially convey OUV-related information were obtained.Furthermore, 29 sentences were randomly sampled from those 388 for the expert evaluation.
Survey Design Each of the 56 sentences was fed into BERT and ULMFiT models to obtain the predictions and confidence scores.The predicted selection criteria with the highest confidence scores by both models were considered as the top-1 predictions.Two other criteria within the top-3 classes predicted by both models with relatively high confidence scores were considered as the top-3 predictions for the survey.Another random cultural criterion that was not predicted by any model to be top-3 classes was considered as the negative class for each sentence.Criteria for natural heritage were not sampled as negative classes as they are not easily confused with the positive cultural ones.As a result, each sentence got four criteria to be evaluated.All four criteria were presented in a random order for each sentence, asking for an evaluation about the relevance of the sentence conveying the criterion on a 5-point Likert scale (from "5: make much sense", to "1: make no sense").The "important" words with higher attention weights in the GRU+Attn model were highlighted in bold.An example of such evaluation on Qualtrics platform is shown in Figure 6.The sentences from the three data sources were grouped in four separate sessions, while the social media data were split into two sessions.The session of "justification for criteria" were always presented first during evaluation, also as a practice for the experts.The other three sessions were presented in a randomized order to prevent systematic errors caused by impatience or tiredness.Additional questions about the familiarity for heritage value identification, familiarity about Venice, confidence of evaluation, usefulness of highlighted words, and overall enjoyment and difficulty of the exercise were respectively raised before and after the evaluation, also with 5-point Likert scale.Note the number of samples involved in the in-depth expert evaluation is relatively small, which is not uncommon in qualitative validation.Moreover, we plan to conduct online non-expert human evaluation in follow-up studies, which could involve more participants with larger sample sentences.It would, however, serve a different purpose than the expert evaluation presented.

General Analyses
The evaluations took 55.10 ± 20.74 minutes to finish.The eight experts are all (top-1 accuracy, top-k accuracy, and averaged macro F1), and independent SD test set (top-1 match and top-k match), where k=3.The best score for each metric is highlighted in bold, and underlined if the best score occurs in models with LS.The effect of adding LS to each baseline is marked with background colors: blue indicates a rise in performance, red indicates a drop, while grey indicates a tie.The darker background color indicates a larger variation in performance.

Figure 1 :
Figure 1: The average training curve of bestperforming models in experiments under 10 random seeds for each baseline on validation split.The x-axes show several epochs before the early-stopping happened.The numbers of epochs are different for each baseline as described in Appendix C. Orange curves with triangles show the top-k (k=3) accuracy with uniform LS, red curves with crosses the performance of prior LS, green curves with circles for vanilla LS, and blue curves with stars show the performance without LS.95 % confidence intervals of the performance based on the 10 random seeds are shown in shades.

Figure 2 :
Figure2: The overall and fine-grained top-3 predictions of models, and attention weights of GRU+Attn and BERT models on the exemplary sub-sentences concerning criterion (i) in Venice.The left part of the image reports the top-3 predictions of all 5 models when the models take the aggregated paragraph as input.The top part reports the fine-grained top-3 predictions of two models on each sub-sentence.The rest of the image visualizes the attention weights.Attention weights of GRU+Attn is visualized in grey-scale, and that of BERT is illustrated using BertViz as coloured bars.

Figure 3 :
Figure 3: The distribution as violin plots of expert evaluations given to the relevance of selection criteria and sample sentences about Venice from three sources.The scores for top-1 and top-3 classes and the negative class predicted by the models are plotted separately.The 25%, 75% percentiles, and the medians are also shown.

Figure 5 :
Figure 5: The confusion matrices of ULMFiT and BERT on test split.

Table 2 :
The performance of models with and without LS on validation split, test split

Table 3 :
Jokilehto (2008)class metrics over all models on validation and test splits with LS, and the main focus of each criteria adapted fromJokilehto (2008).

Table 7 :
The model performance in terms of resource occupancy and inference time.The inference is conducted on Intel Core i7-8850H CPU.Inference time per Item shows the average time the model uses to make a prediction on one sentence.And Inference time for SD shows the total time the model needs to fully process and predict the independent Short Description (SD) test set.