TabPert: An Effective Platform for Tabular Perturbation

To truly grasp reasoning ability, a Natural Language Inference model should be evaluated on counterfactual data. TabPert facilitates this by assisting in the generation of such counterfactual data for assessing model tabular reasoning issues. TabPert allows a user to update a table, change its associated hypotheses, change their labels, and highlight rows that are important for hypothesis classification. TabPert also captures information about the techniques used to automatically produce the table, as well as the strategies employed to generate the challenging hypotheses. These counterfactual tables and hypotheses, as well as the metadata, can then be used to explore an existing model's shortcomings methodically and quantitatively.


Introduction
Given the factual evidence, a crucial part of NLP model reasoning capacity is if it can evaluate whether a given hypothesis is an entailment (true), contradiction (false), or neutral (can't be determined). Current transformers-based models have been shown to outperform humans on these tasks when the evidence is presented as simple unstructured text (Wang et al., 2018(Wang et al., , 2019; however, when tested with semi-structured evidence (Gupta et al., 2020;, such as tables, as shown in Figure 1, the very same models struggled to match human accuracy (Neeraja et al., 2021;Wang et al., 2021;Aly et al., 2021).
Furthermore, there can be several reasons for a model's correct predictions on a particular example. For example, Poliak et al. (2018); Gururangan et al. (2018) show that multiple NLI datasets, such as SNLI and MNLI dataset (Bowman et al., 2015;Williams et al., 2018) exhibit hypothesisbias, i.e., the hypothesis-only model performs significantly better than the majority label baseline. In the context of tables, Gupta et al. (2020) shows that  (2011) H1: NYSE has fewer than 3,000 stocks listed. H2: Over 2,500 stocks are listed in the NYSE. H3: S&P 500 stock trading volume is over $10 trillion.
Figure 1: A tabular premise example. The hypothesis H1 is entailed by it, H2 is a contradiction and H3 is neutral i.e. neither entailed nor contradictory.
the right prediction doesn't always imply reasoning: there can be dataset biases in semi-structured datasets too, such as hypothesis or premise artifacts (spurious patterns) which can wrongly support a particular label. Furthermore, a model can also ignore the ground evidence and use its pre-trained knowledge for making predictions. These models when deployed in the real world on out-of-domain (different category) or counterfactual (stories tables) examples fail embarrassingly. One way to avoid this inflated performance projection is to test models on several challenging sets before actual deployment. For example, Gupta et al. (2020) evaluate the RoBERTa (Large) (Liu et al., 2019) models on two additional adversarial sets (hypothesis perturbed and out-of-domain) and observe a significant performance drop. However, manually creating such challenge sets can be tricky, both in terms of the annotation cost involved and the actual annotation process, especially with tabular data of semistructured nature.
Recently, Ribeiro et al. (2020a) showed that one can deploy simple tricks to semi-automate this process and develop several adversarial counterfactual contrast sets by altering existing data to perform behavioral testing of a model. However, such tricks currently only work for unstructured text and cannot be directly adopted for semi-structured text such as tables. To fill this gap, in this work, we present TABPERT. TABPERT which is an annotation platform especially designed to work on semistructured tabular data. TABPERT support semiautomatic creation of tabular counterfactual data. Through TABPERT annotators can modify tables in several ways such as (a) deleting information: deleting an attribute-value pair or an existing row completely, (b) inserting information: inserting an attribute-value pair for an existing row or creating a fresh row, (c) modifying information: editing the attribute or values cells of an existing row, and (d) modifying hypothesis or label: modifying an existing hypothesis and its inference label.
TABPERT also automatically logs the modification operation for each attribute-value of the table with respect to the original table. Furthermore, users can manually log information about the relevant rows and the strategy used for perturbing a table-hypothesis pair, in addition to the gold label, through TABPERT. Such metadata is very important in assessing annotated data toughness and can be later utilised to systematically study failure modes of a model.
The contributions of our work can be summarised as below: 1. TABPERT can delete, modify, and insert information in semi-structured tabular data for creating counterfactual examples 2. TABPERT auto-logs table perturbation metadata, and support manual hypothesis modification and inference labels selection.
3. TABPERT assists users in logging metadata including hypothesis-related rows and the perturbation strategy used, which is useful for data and model quantitative analysis 4. We present a case study for TABPERT via the generation and evaluation of a counterfactual INFOTABS dataset and models respectively.
The TABPERT source code, the annotated counterfactual INFOTABS dataset, along with the RoBERTa Large model, the annotation instructions and examples set, and all other associated scripts, are available at https://github.com/ utahnlp/tabpert. The instruction video describing TABPERT usage is accessible at https://www. youtube.com/watch?v=sbCH_zD53Kg.

Tables are Challenging
One might argue that creating a counterfactual dataset for tables is not a challenging task, and that table modification can be fully automated by merely 'shuffling' or 'inserting' attribute values of one table row into another table row (with the same attribute) as long as they are from similar categories, e.g. shuffle 'producer' of one film with 'producer' of another film). One can extend this further by shuffling rows with different attributes in the same as well as different tables (same category) as long as the name-entity type for values are similar, e.g. shuffle 'producer' with the 'director' of the same or a different film with each other.
However, this approach does not automate the modification of corresponding hypotheses and their inference labels. Furthermore, such automatic shuffling encourages the violation of certain natural common-sense logical constraints, such as a person's 'Birth Date' must be before their 'Died Date', a person's 'Marriage Date' should be after their 'Birth Date' and before their 'Died Date', an album's 'Released Date' should be after its 'Recording date' and so on. Without the enforcement of these constraints, the table will be selfcontradictory. While some of these constraints can be automatically satisfied and hence not violated, a majority still slip through because of their sheer variety and variation. 1 . Furthermore, enforcing these constraints automatically during perturbation is a challenging job due to its domain-specific nature. However, such automatic perturbations can be a good initialization for our TABPERT tables, which can then be manually inspected and modified by human annotators for self-consistency i.e. no natural common sense violation.

TABPERT Functions, Aspects, and Usability
TABPERT is currently supported on common web browsers such as Google Chrome and can be installed to run locally 2 . There are three main steps required for successful annotation, as described below.

Automatic Initialization
First, we initialise TABPERT with original tables and automatic counterfactual tables generated via Bit Location Same Different   1  Dataset  0  1  2  Category  0  1  3  Table  0  1  4 Key 0 1  (0), the 2 nd bit indicates the different (1) or same (0) category, the 3 rd bit represents if the value is from the same (0) or from different (1) table (for same table, 1 st and 2 nd bit is always zero), and the 4 th represents same (0) or different (1) attribute.
automatic 'shuffling' of table rows or attribute values. Automatic initialization is beneficial as manual table creation is both time-consuming and highly error-plausible. This automatic shuffling operation is stored as a metadata of each attributevalue in the first 4 bits of a 7-bit string, and can be used later to analyse which shuffling was more effective. Table 1 shows the meaning of each of these bits 3 . Counterfactual hypothesis (and label) initialization is done by copying the information from the original table exactly.

Modifying Tables
The automatically perturbed tables from initialization can now be manually modified to create counterfactual examples 4 . All the cells (attributes and values) in the three counterfactual tables can be edited 5 . TABPERT also supports five additional functions for more challenging edits. The 'Add' box allows annotators to write text and drag and drop it into a table to create a new value. For deleting a value cell, simply drag and drop the value cell to the 'Delete' Box. One can also edit the text for an existing attribute or value via clicking it. Lastly, one can delete any row with the 'Edit' option, and also insert a new row and its details using the 'Add Section' button. These modification details are also recorded automatically in the last 3 bits of the 7bit metadata. The 5 th bit represents copy-paste from original, the 6 th represents a new cell or row addition, and the last 7 th bit represents a value update operation. Figure 2a shows the main parts of the TABPERT platform for counterfactual table perturbation.

Hypothesis Modification and Metadata
The text of a hypothesis of a counterfactual table can be edited directly and its corresponding label can also be selected from drop-down menu options. In addition to this, other metadata information is also collected: 1. The strategies used by the annotator to modify the hypothesis. The five main strategies can be selected using the multi-value check-box. The 'Other' option corresponds to hypothesis changes that don't fall into main strategies.
2. All the relevant rows of the table which are necessary for deciding the inference label. Figure 2b shows the main TABPERT view for hypothesis modification, with hypothesis and inference label. Metadata is inserted by the annotator via clicking '+' symbol on the left side (below label drop-down) for each hypothesis, as shown in Figure 2b. This opens a metadata collection window, as shown in 2c). Here too, we use 6 bits: the initial five for each strategy (bits order is the same as the order in which the strategies are mentioned in TABPERT as shown Figure 2c), and the last one for the 'Other' option. We store the relevant rows 'attribute keys' in a list (array) for each hypothesis along with the final hypothesis text modification.
TABPERT Aspects: The TABPERT web-app's core tech stack consists of ReactJS 7 and Flask 8 . Here, Flask is used as the main back-end Python web framework, and javascript library ReactJS is used for the front-end. We used Flask because is easy-to-extend, giving us the ability to easily integrate Python libraries for quick manipulation of JSON and TSV files. We used ReactJS because of 7 https://reactjs.org/ 8 https://flask.palletsprojects.com/en/2.0.x/ (a) Furthermore, for making the neutrals harder we advise annotators to modify them by adding extra 'true' information from the same table in the hypotheses. The above discussed procedure ensures that (a) the final label is balanced, (b) hypothesis bias is destroyed by flipped label (Gupta et al., 2020;, and (c) 'neutrals' are now closer to 'entails' in terms of lexical overlap (Glockner et al., 2018). . Finally, after annotation we have 109 counterfactual tables with a total of 982 representation and hypothesis-only model for making predictions on the original and counterfactual annotated data. Table 2 shows the performance result. The data was represented in 'para' form, one with all rows' sentences (all Rows) and other with just relevant row sentences (coming from the annotation meta-data).
Performance Analysis: Clearly the same model struggles with the counterfactual annotated data. Furthermore, better performance with relevant rows for counterfactual data shows that the model is probably using irrelevant rows tokens as artifacts for making predictions (Neeraja et al., 2021). The hypothesis-only model performance on counterfactual data is close to majority-label baselines. Additionally, human find the original and counterfactual data equally challenging, obtaining performance of ≈ 85% on both sets. 11 .

Model Type
Original Counterfact  Perturbation Analysis: We also do an analysis using the hypothesis annotation metadata to check which strategies of hypothesis modification is more effective. From Figure 3, it is evident that manual table change (TC) for Label Flip (LF) is more effective than manual hypothesis change (HC) for label flip (TF). Furthermore, strategies involving label flip is more effective then Hypothesis Prompt (Hypo Prompt) and text overlap. We suspect this is because of the ineffectiveness of hypothesis bias with flipped labels. Surprising, on new hypotheses, there is marginal performance improvement, indicating simple new data creation is an ineffective approach. Furthermore, there is no significant performance drop with any other perturbation approaches.
We also did a similar analysis on the table perturbation metadata; refer to appendix section A for details. We also show some qualitative example of counterfactual perturbation for each strategy in appendix section C.

TABPERT Utility
Main Platform: TABPERT is a tool designed particularly for annotating counterfactual tabular reasoning data, and as such, it has numerous optimizations, tools, and features that aid in the creation and collection of huge amounts of data, as well as annotating it faster and better. It allows for a broader range of tasks than, perhaps, utilizing spreadsheets or MTurk to modify such data. The drag-and-drop functionality simplifies annotation, making it simple to visualize a complicated job.
In tabular form, all of the data may be examined at once. The background type validation reduces mistakes while dragging and dropping.
Other Task Usage: The initialization source code, as well as the platform, are designed to be modular, so that new components can be readily added, deleted, and existing ones may be updated. For example, the ability to reorganize table parts; copying values across table triplets (in addition to cut-paste); and auto-saving work with an undo option 12 , as well as checkpoints added to reverse mistakes, may all be readily implemented.
Meta-Data: The metadata collected might be used to generate challenging counterfactual ad-verserial test sets. For example, if a hypothesis comprises of shuffled rows from the train set and the original inference label is inverted, it might be a good candidate for evaluating NLI model overfitting. Furthermore, it is an excellent test for hypothesis artifacts if the hypothesis remains the same but the label is flipped. The hypotheses-specific rows can be used to reduce the table and explain the inference label reasoning. The marked labels can be utilized as the gold standard for existing label verification. Counterfactual tables may also be used to assess pre-trained knowledge overfitting. These are only a few examples of the numerous conceivable application situations.
We also compare and contrast TABPERT with Spreadsheet on effectiveness, visual benefits, and meta-data collection aspects in appendix section B

TABPERT Limitations and Future
During our pilot study, the platform was run locally by the annotators. This was not problematic since the number of annotators was small and the tables were divided between them. We need to host our platform on a central server if we want multiple annotators to be able to make simultaneous edits for large-scale deployment. This is something which we plan to do in the near future. Lastly, the counterfactual data created by making changes had to be manually saved by pressing a button. This was done so that if the user made some mistake the original data would not be lost and the user could store the data after being satisfied with the changes. We wish to add a auto-save feature along with undo options to cater to both these scenarios.

B TABPERT vs Spreadsheet
Effectiveness: When utilizing spreadsheets for annotation, it becomes quite difficult and time consuming to cut(copy)-paste in numerous rows cells several times. However, the efficient drag(click)drop with auto checking restrictions in TABPERT makes it a lot easier and faster procedure. Editing and text alteration are also easier than in a spreadsheet. Our study found that it takes around 7 minutes on average to annotate a new table with 9 statements using TABPERT, but the same work in the spreadsheet would take more than 40 minutes.
Visualisation Benefit: Our platform's table visualisation provides a full data view on a single screen. Seeing the entire picture (table and sentences) is incredibly helpful for quality checking of annotations. It also allows the annotator to quickly follow label and hypothesis changes, which is not feasible with the spreadsheet's cell type view. Furthermore, having a single screen focus view on a single counterfactual table makes altering hypotheses even easy. Using this focus feature, it is straightforward to update the labels or add new information to the hypothesis. This focus view is not viable with a spreadsheet; to make appropriate alterations, one must search and navigate to each spreadsheet cell.
Finally, the lack of scrolling required while dragging and dropping on our platform saves annotators time. To discover the relevant cells in a spreadsheet, one must execute numerous scrolling operations to the up, down, left, or right.
Furthermore, in TABPERT, the cell size is automatically set according on the underlying information, but in spreadsheet, this must be handled manually.
Meta-Data Collection: TABPERT also makes it simple to gather information such as methods used to change a hypothesis and rows utilized to answer each hypothesis by simply utilizing a checkbox. In a spreadsheet, this would require 9 columns of check-boxes for each table or manually writing the meta0data, which is now automatically done with a single click, thus making the process simpler, efficient and speedier.