BOUN at SemEval-2021 Task 9: Text Augmentation Techniques for Fact Verification in Tabular Data

In this paper, we present our text augmentation based approach for the Table Statement Support Subtask (Phase A) of SemEval-2021 Task 9. We experiment with different text augmentation techniques such as back translation and synonym swapping using Word2Vec and WordNet. We show that text augmentation techniques lead to 2.5% improvement in F1 on the test set. Further, we investigate the impact of domain adaptation and joint learning on fact verification in tabular data by utilizing the SemTabFacts and TabFact datasets. We observe that joint learning improves the F1 scores on the SemTabFacts and TabFact test sets by 3.31% and 0.77%, respectively.


Introduction
Recognizing Textual Entailment (RTE) (Dagan et al., 2005) is one of the core NLP problems for understanding the semantic relations between words and sentences, which is useful for other tasks including Question Answering (Abacha and Demner-Fushman, 2019), Text Summarization (Lloret et al., 2008), and Text Classification (Yin et al., 2019). For the RTE task, datasets of various sizes (Dagan et al., 2005;Bowman et al., 2015) and from different domains (Romanov and Shivade, 2018) have been introduced. However, these works and datasets are solely focused on textual data without considering structured data such as tables.
Recently, question answering (Iyyer et al., 2017) and textual entailment datasets (Wenhu Chen and Wang, 2020;Wang et al., 2021) for tabular data have been introduced. SemEval-2021 Task 9 addresses the problem of statement verification 1 The grammatical error exists in the given dataset.
Distance statistics between buildings of ancient buildings and modern buildings to the main water channel (unit: meter

Statement Label
There are 2 types of building -Ancient building and Modern building.

Entailed
All the values of Ancient building is less than Modern building except MIN value.

Entailed
The value of Modern building is is lesser than Ancient building in AVERAGE. 1 Refuted Figure 1: Sample table, description, and statements from SemTabFacts.
(Phase A) and evidence finding (Phase B) using tables from scientific articles (Wang et al., 2021). The shared task also introduced a new dataset, namely the SemTabFacts dataset, an example from which is provided in Figure 1. The goal of Phase A (Table Statement Verification) of the shared task is to determine whether a statement is entailed, refuted, or unknown given a table and its description (if available). For example, given the table and its description in Figure 1, the first two statements are entailed, whereas the third statement is refuted. This example demonstrates that there are various challenges such as understanding numerical operations and comparisons as well as textual entailment. Transformers architecture (Vaswani et al., 2017) enabled the pretraining of large language models, which achieve significant improvement in numerous NLP tasks (Wang et al., 2018(Wang et al., , 2019. Recent works have also focused on pretraining language models for tabular data by introducing new embedding layers and objective functions, as well as large-scale augmented data to better represent numerical values and rankings (Herzig et al., 2020;. Data augmentation is a way to enrich training data to improve the supervised training scheme and is widely used in computer vision (Perez and Wang, 2017) and speech recognition (Park et al., 2019). Different text augmentation techniques such as back translation, synonym replacement, and text editing have been investigated for various tasks including text classification (Wei and Zou, 2019) and natural language inference (Min et al., 2020).
In this study, we aim at investigating the impact of text augmentation on the statement verification task from tables. We implement various text augmentation techniques based on WordNet (Miller, 1998), Word2Vec (Mikolov et al., 2013), and Back Translation (Yu et al., 2018) to enrich the statement variety in the SemTabFacts dataset. We finetune a recently introduced pretrained transformer architecture, the TAPAS model , for our approach. In addition, we investigate the domain adaptation and joint learning capabilities of two tabular fact verification datasets: SemTabFacts and TabFact. Promising results are achieved on the SemTabFacts test dataset.

Datasets
We use two different table-based fact verification datasets for the experiments: SemTabFacts (Wang et al., 2021) and TabFact (Wenhu Chen and Wang, 2020). We compare SemTabFacts and TabFact in terms of the average size of the tables, average word length of the statements, and the number of examples for each class in Table 1. We only report the statistics for the training sets, since the development and test sets have similar distributions with the training sets in both datasets. There is almost an order of magnitude difference between the datasets in terms of the number of tables and statements. Furthermore, we observe that the average table size and average statement length in terms of words are greater in TabFact than SemTabFacts.
SemTabFacts (Wang et al., 2021): This dataset consists of tables from articles published in Elsevier, which are available on ScienceDirect. After filtering complicated examples, five entailed and five refuted statements about these tables are generated by high-quality crowd-sourcing. These statements are further verified by additional crowdsource workers, especially for filtering out ungrammatical sentences. To increase the quality level, Wang et al. (2021) further verified the statements in the development and test sets. The SemTabFacts dataset also contains automatically generated statements and unknown classes in the development and test sets for the fact verification and evidence finding tasks. In this study, we target two-way (Entailed / Refuted) classification without automatically generated statements for the fact verification task.
SemTabFacts releases tables and statements in XML format. We convert these tables into CSV format to properly use in our models. Due to cells with multirow and multicolumn features in XML, we could not accurately convert all tables into CSV, which might affect our models' overall performance. We manually checked the XML to CSV conversion of 50 tables. We identified three errors related to multirow and multicolumn features, and one error that causes a missing column.
TabFact (Wenhu Chen and Wang, 2020): This dataset crawls tables from Wikipedia articles following previous works on table question answering (Pasupat and Liang, 2015;Zhong et al., 2017).  Complicated tables including multirows, multicolumns, and latex symbols, and large tables with more than 50 rows or 10 columns were filtered out. Amazon Mechanical Turk was used to generate simple and complex statements about tables. The Mechanical Turk workers also filtered out poor statements that have grammatical errors or vague claims. Finally, annotator agreement scores were computed by having the same set of statements labeled by another set of Mechanical Turk workers.

TAPAS
Deep transformers models such as BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019) have achieved significant improvement in different NLP tasks as seen in the GLUE (Wang et al., 2018) and SuperGLUE (Wang et al., 2019) benchmarks. However, it is not straightforward to benefit from these models for structured data formats such as tables or graphs. TAPAS (Herzig et al., 2020) introduces different objectives such as cell selection and aggregation prediction, and new additional embeddings such as column/row id and rank id over BERT's architecture, which are more suitable for complex numerical operations and comparisons in tables. The TAPAS model has been designed by focusing on the task of question answering over tables (Herzig et al., 2020). However, TAPAS fails to handle complex compositional structures like multiple aggregations and large tables due to the maximum length limit of the tokenizer.
To overcome the problems in (Herzig et al., 2020), recently,  introduced new mechanisms such as table pruning to make TAPAS work with large tables without memory errors. Furthermore, two augmentation methods for statements were presented . The first one is based on creating counterfactual statements by replacing entity mentions with other entities on entailed examples to populate negative samples. The second one is based on a synthetic data generation method to populate statements with complex numerical operations.
In this study, we use a TAPAS model from HuggingFace's Transformers library (Wolf et al., 2019). This model is pretrained on Masked Language Model and additional intermediate pretraining steps as discussed in . In addition, it is finetuned on the TabFact dataset (Wenhu Chen and Wang, 2020). We further finetuned this model on SemTabFacts (Wang et al., 2021) with additional augmentation steps by utilizing WordNet, Word2Vec, and back translation.

WordNet
WordNet (Miller, 1998) is a lexical database that groups words into adverbs, adjectives, nouns, and verbs, and shows the relations between them such as hyponymy, antonymy, and synonymy. In this work, we focus on swap-based WordNet augmentation that changes words by their WordNet synonyms. The implementation is done by the Tex-tAttack (Morris et al., 2020) library. As shown in Figure 2, the word lowest is changed to small by synonym swapping.

Word2Vec
Word2Vec (Mikolov et al., 2013) is a technique to find dense word embeddings by shallow networks.  It helps to represent syntactic and semantic features of words by a dense vector. Due to the low dimensional space, similar words and synonyms have closer word embeddings. We make use of this feature of Word2Vec to replace words with their Word2Vec synonyms by TextAttack library. For example, this augmentation technique changes the word compare to comparisons as shown in Figure  2. While WordNet augmentation preserves the part of speech tags of the words, Word2Vec augmentation may distort the part of speech tags and may produce ungrammatical sentences.

Back Translation
The back translation technique paraphrases a given sentence from a source language by translating it into another target language and then translates it back into the source language. It was first introduced as a data augmentation mechanism for the reading comprehension task (Yu et al., 2018), where significant improvement was observed by back translation augmentation. Recent machine translation systems (Sennrich et al., 2016) are robust to back translation mechanism and tend to produce the same sentence. To overcome this issue, we used two different versions of the same system. First, we translated the statements in English to Turkish by Google Translate 2 . Then, we translated the Turkish statements into English by the GOOGLETRANSLATE function in Google Sheets. The back translation method paraphrases the sentence and unlike the WordNet and Word2Vec approaches, it may change word order in addition to the words, as illustrated in Figure 2.

Experimentation and Results
We conduct two different experimental setups to compare our results. In both experiments, we finetune all layers of a pretrained TAPAS model and its classifier head. First, we finetune the TAPAS model on the SemTabFacts dataset with all combinations of different augmentation techniques. Second, instead of using augmentation techniques, we finetune the TAPAS model on TabFact only and then on SemTabFacts and TabFact jointly and compare the results on the test sets of TabFact and SemTabFacts. In all these experiments, we use the AdamW (Loshchilov and Hutter, 2018) optimizer with a 5e − 5 learning rate and 0.01 weight decay. We set batch size as 8 with 2 accumulation steps, and the number of steps used for linear warm-up is 100 in SemTabFacts training and 2000 in TabFact and joint training. We finetune this model over 10 epochs and decide the best model based on the development set. We use the official evaluation metric, which is the macro-average of F1 scores over the tables.
In the augmentation steps, we include new augmented statements for each statement and augmentation method to the training data of SemTabFacts. The original versions of the development and test sets are used without any augmentation. We observe that different augmentation techniques in SemTabFacts can improve F1 scores on the test set as shown in Table 2. The best model for the development set of SemTabFacts is the model with WordNet and back translation augmentations. Besides, all augmentation techniques, except back translation, improve the test F1 score over the base model without augmentation. Finally, we observe that WordNet augmentation increases the test F1 score by 2.5% over the base model without augmentation.
In Table 3, we investigate the domain adaptation and joint learning capabilities of SemTabFacts and TabFact. We have finetuned three separate models, with SemTabFacts training data, TabFact training data, and SemTabFacts and TabFact training data. We evaluate these finetuned models on the development and test sets of the datasets. The original versions of the training, development, and test sets are used in these experiments without any additional augmentation. The model trained with TabFact and SemTabFacts data achieved the highest F1 scores on the test sets of both datasets. The joint model improves the F1 score by 3.31% and 0.77% on the SemTabFacts and TabFact test sets, respectively. Further, we observe that we can achieve similar scores on the SemTabFacts test set when the model is trained on the SemTabFacts training data or on the TabFact training data.
We further analyzed the errors in terms of table sizes (number of rows x number of columns) and length of the statements. However, our results indicate that there is no significant difference in F1 scores for different table sizes and different lengths of statements. Our models have similar performance for small and large tables as well as for short and long statements.

Conclusion
In this work, we described our models for the Table  Statement Support Subtask (Phase A) of SemEval-2021 Task 9. Our base model relies on the recently introduced pretrained transformer architecture for tabular data, TAPAS. We proposed three different augmentation techniques which are based on Word-Net, Word2Vec, and Back Translation. We showed that all combinations of these augmentation techniques except Back Translation perform better on the test set than methods without augmentation. Furthermore, we investigated the domain adaptation and joint learning capabilities of SemTab-Facts and TabFact. We showed that our best model in terms of development and test F1 for SemTab-Facts occurs when we trained TAPAS jointly on the SemTabFacts and TabFact datasets. Additionally, we illustrated that the joint model achieves better results on the TabFact test set than the model trained only on the TabFact training dataset. As future work, we plan to focus on better preprocessing the SemTabFacts dataset and more diverse augmentation techniques by integrating perplexity scores of augmented statements.