Metric-Type Identification for Multi-Level Header Numerical Tables in Scientific Papers

Numerical tables are widely used to present experimental results in scientific papers. For table understanding, a metric-type is essential to discriminate numbers in the tables. We introduce a new information extraction task, metric-type identification from multi-level header numerical tables, and provide a dataset extracted from scientific papers consisting of header tables, captions, and metric-types. We then propose two joint-learning neural classification and generation schemes featuring pointer-generator-based and BERT-based models. Our results show that the joint models can handle both in-header and out-of-header metric-type identification problems.


Introduction
Tables are powerful tools for presenting data efficiently in row and column views. In scientific papers, numerical tables are commonly used to show experimental results for facilitating data analysis. Examples of numerical tables in scientific papers are shown in Figure 1.
Tables have the ability to cover multiple categories written in table headers by incorporating several header sets in a hierarchical view, called multi-level header tables. Scientific papers have strict guidelines about tables; for example, one states that a similar type of text is written in the same level of header. Figure 1a shows a multi-level header example in the column part, with task type (Task 1 and Task 2) in the first header-level and metric-type (Prec and Rec) in the second. The table also has a row header specifying the model type (Model A, Model B, Model C, and Model D). In the real-world, this header-type information is limited due to the unknown table scheme. However, we assume tables in scientific papers follow the rule of  categorizing a similar type of header name in the same header-level.
To understand the numbers in the tables, metrictypes are important for discriminating the numbers. A comparison between numbers is applied for numbers in the same metric-type with different categories. For the table in Figure 1a, we cannot compare the number 60 for Model A in the first column with 60 in the second one because they have a different metric-type: Prec and Rec. Computing numbers with different metric-types will result in inaccurate analysis.
Different tables may have different ways of writing their header name, such as using abbreviations like p, pre, or prec to refer to precision. Due to the lexical diversity of header names, metric-type identification becomes more challenging. Using a rule-based metric-type tagging or a limited set of metric-types in a dictionary is not enough to cover the diversity. Since tables in scientific papers typically have logical captions and logical categorization of the header-level, we introduce a metric-type identification task that locates the metric-type in the headers by using the caption and header name as inputs. For the example shown in Figure 1a, the metric-type is located in the second level of the column header.
We also cover tables that do not mention metric-types in their header (out-of-headers), as shown in Figure 1b. In these cases, the metric-types are identified in the caption. To cover metric-types located both in the headers and not in the headers, we propose a joint framework of metric-type location prediction and metric-type token generation for the metric-type identification task in multi-level header tables.
Our contributions are as follows: • We introduce a metric-type identification task for multi-level header tables and propose joint location prediction and generation models to solve the task.
• We provide a dataset consisting of multi-level header numerical tables, captions, and metrictypes, extracted from scientific papers. Our datasets will be publicly available 1 .
• We introduce a multi-level header table encoder mechanism to obtain table header representations and propose a pointer-generatorbased model to cover out-of-headers in the metric-type identification task.
• We fine-tune a general pre-trained encoder (BERT) and a domain-specific encoder (SciB-ERT) in our task and present the experimental results. We show that the models incorporating the pre-trained encoders lead to significant performance gains, especially when using a domain-specific one. Specifically, for numerical variables, they retrieved metric-types by searching a set of possible tokens in the dictionary. Focusing on numerical tables, Nourbakhsh et al. (2020) extracted metric-types in earning reports by using similarity scores between the corresponding non-numeric text for the leftmost cells and stored metric-types. The work closest to ours is the one by Hou et al. (2019), who used tables from the experimental result section, combined with the title and abstract as document representations to extract triples of tasks, dataset, and metric for leaderboard construction. In our study, we represent the tables in more generic ways, preventing the original table structure in the multi-level headers form. We intend to retain the ability of a table to cover complex categorization in the headers and efficiently present all values. A previous study that also explored multi-dimensional tables was done by Milosevic et al. (2016) to automatically detect table structures from XML tables.

Related Work
Our pointer-generator-based model in the metrictype generation scheme is inspired by the promising results of the pointer-generator network (See et al., 2017) in the summarization task. The network deals with the out-of-vocabulary issue by joint copying from source texts and generating from vocabularies.
Recent studies have shown that pre-trained encoders can be successfully fine-tuned for downstream NLP tasks, thus avoiding the need to train a new model from scratch. A pre-trained encoder BERT (Devlin et al., 2019) was trained on the BooksCorpus (800M words) and Wikipedia (2,500M words). For better-contextualized representation in the scientific domain, Beltagy et al. (2019) introduced a domain-specific BERT model, SciBERT, which was trained on 1.14M papers from Semantic Scholar. Friedrich et al. (2020) implemented both BERT and SciBERT on their models to solve the information extraction task and achieved significant performance gains.

Metric-Type Identification for
Numerical Tables

Datasets
We automatically extracted tables from the PDF files of scientific papers in the computational linguistics domain using PDFMiner and Tabula as extraction tools and filtered only numerical tables related to experimental results using the keywords evaluation, result, comparison, and performance. We used papers from the ACL and EMNLP conferences (2016 to 2019) on the ACL Anthology website as data sources.
In tables in actual scientific papers, knowledge about the table semantics is rarely provided. On the basis of how information is "read" from a table, Hurst (2000) separated functional table areas into access cells and data cells. Access cells consist of column headers and/or row headers. We define data structure on the basis of their functional areas: (ch), and cells. Headers in the row and column parts have several levels, and we assume that header names in the same level have the same type. Figure  2 shows our table structure. We hired several qualified workers in the computer science field to manually check the extracted table structure to ensure the separation of row headers, column headers, and cells was correct, as shown in Figure 3. Then, they annotated the metrictype of the tables by prioritizing the locating of the metric-type in a specific header-level. The annotators were able to identify the metric-types of approximately 70% of the tables in their headers, and they determined the metric-type of the rest of the tables on the basis of information from the table captions. When no metric-type was mentioned in the headers, we assumed the metric-type was the same for all table values. The structures from the example in Figure 3 are capt: "model comparison in task 1 and 2"; rh level 1: [models, models, models, models]; rh level 2: [model a, model b, model c, model d]; ch level 1: [task 1, task 1, task 2, task 2]; ch level 2: [prec, rec, prec, rec]; and metric-type: [prec, rec, prec, rec] (identified in ch level 2).
We split our dataset into training, validation, and test sets. The statistics of our dataset are provided in Table 1.

Problem Definition
Let Table = The task is to identify metric-type set (m) in the specific level of row header (rh k ) and column header (ch l ). To handle tables that do not include metric-types in their headers, we generatê m by using information from the table caption. The formulation of the metric-type identification is as follows: (1) where W m is a set of metric-types in the vocabulary.

Models
We propose neural models to identify the metrictype for multi-level header tables by means of a joint model of metric-type location prediction and metric-type token generation.

Pointer-Generator with Supervised Attention Model
We obtain the representations of captions and header-levels by using a BiLSTM encoder and then capture header-level weights using supervised attention between the header-level encoder and the metric-type header-location outputs. In the generation scheme, we adopt the pointer-generator network to take into account captions as source texts and the metric-type vocabulary in the metric-type generation gate. The architecture of our model is shown in Figure 4.
Header encoder We use the vector representation of each header-level by averaging the vectors of all header name tokens in the same level. Given E rh k and E ch l as the averages of the initial vector representations of the row and column header-level vectors, respectively, we use the BiLSTM encoder with the dot attention mechanism proposed by Luong et al. (2015) to obtain the representations of the row and column header-levels and select the last hidden state of the last level combined with the weighted hidden states as header-level contexts, as follows: Caption encoder As with the headers, we use the BiLSTM encoder with attention a capt i to compute the context vector of caption C capt .
Metric-type header-location gates We feed the concatenation of the row and column header contexts to the softmax layer to obtain the metric-type header-location probability: which includes the probabilities of the metric-types located in row headers (p rh ), located in column headers (p ch ), or not located in the headers (p capt ), where p rh + p ch + p capt = 1.

Metric-type header-level gates
Since the attention scores a rh k and a ch l capture the relevant header-level information in row and column, these attention scores are used as header-level weights as follows: where i ∈ {1, ..., u, (u+1), ..., (u+v)} as a headerlevel index.
Metric-type generation gates In our pointergenerator network, we use the sigmoid layer to obtain a switch copy probability: which lets us choose between copying word w capt from a table caption and generating word w m from the metric-type vocabulary, where p copy ∈ [0, 1].
We use a softmax function to compute the probability distribution over the metric-type vocabulary: Then, we obtain the following probability distribution over the extended vocabulary: where i is the index of metric-type tokens in the vocabulary.
Learning objective For training, we exploit the negative log-likelihood objective as the loss function. In addition, we adopt supervised attention (Liu et al., 2016) for jointly supervising the row and column header-level attention to obtain the metrictype header-level. We combine all loss functions in the location classification and token generation model, and define α as the weight as follows: log w hlvl i ) + α(log p copy + log P vocab (w m ))), (9) where c ∈ {capt, rh, ch} is the metric-type headerlocation classes and z hloc is the binary indicator (0 or 1) of each corresponding class.

types of input text, pairs of question and answer, a [CLS] token is appended before question tokens, and [SEP] tokens are placed after question and after answer tokens, to separate the question and answer segments. Following Liu and Lapata (2019), we customize these preprocessing schemes by inserting [CLS] before each segment and inserting [SEP]
after each segment. We divide our inputs into several segments: caption, row header level 1 to u, and column header level 1 to v.
The input text after preprocessing is denoted as a sequence of tokens X = (x 1 , x 2 , · · ·, x n ). There are three kinds of embedding assigned to each x i : token embeddings representing the meaning of each token, segmentation embeddings indicating the segment boundaries of a sequence of tokens, and position embeddings covering token position within the sequences. Since BERT only covers two segments in its input, we treat the odd segment as segment A and the even one as segment B. The sum of these three embeddings is fed to a bidirectional Transformer layer of BERT.
We use the token representations from the top hidden layers of the pre-trained Transformer as context embeddings. We assume the context vectors of each [CLS] token can represent the segment sequences better. As shown in Figure 5, we denote the input embedding as E, the final hidden vector of the [CLS] token for the i th input segment as C i ∈ R H , and the final hidden vector for the j th input token as T j ∈ R H . We use a metric-type header-location gate and a metric-type header-level gate for metric-type location classification, and a metric-type generation gate to generate metric-type tokens from vocabulary covering out-of-header metric-types. Our BERT-based model architecture is shown in Figure  5.
Metric-type header-location gates We feed the first segment context C 1 to the softmax layer to obtain the metric-type header-location probability: Metric-type header-level gates In our task, segments are used to represent the table section that is most related to metric-type. We incorporate the segment context C i to the sigmoid layer to obtain the probability of the metric-type being located in a specific header-level: The probabilities are then normalized to all segments as a weight score of the header-level: Metric-type generation gates We use a softmax function based on the first segment context C 1 to compute a probability distribution over the metrictype vocabulary:   Learning objective We combine all loss functions in the metric-type header-location, metrictype header-level, and metric-type generation gates: where α is the weight of the metric-type generation functions.

Baseline Model
We use two SVM classification models as baselines: a metric-type location prediction model and a metric-type token prediction model from the vocabulary of metric-types. We use tf.idf of the concatenation header name tokens for all levels as input representations in the first model and tf.idf of the caption tokens in the second one. We tuned hyperparameters of the SVM model and reported the best results.

Metrics Evaluation
We use accuracy metrics to evaluate the metric-type location and generated metric-type tokens.
Metric-type location accuracy The target of the metric-type location prediction model is the metrictype located in the row headers, in the column headers, or not found in the headers. The accuracy of header-location (acc hloc ) is the rate of correct header-location predictions.
Since details about the metric-type location in the header-level are needed to identify metric-type token lists, we also compute the accuracy of metrictype header-level (acc hlevel ) using the ratio of correct header-level predictions to the total number of predictions.
Metric-type token accuracy Letm = (ŵ m 1 , ...,ŵ mn ) denote the sequence of predicted metric-type tokens for n r rows or n c columns (depending on the header-location prediction), and m = (w m 1 , ..., w mn ) denote the target ones: for example,m = (f1, f1, f1) and m = (f-1, f-1, f-1). We calculate the metric-type token accuracy using string matching of all token lists inm and m: and string matching of each token pairŵ m i and w m i in the token lists: To cover token prediction with an abbreviation, we compute the metric-type token accuracy based on the ordered character matching as follows:      where d is the number ofŵ m whose characters are all found in w m in the same order. For example, the predicted token RG1 is regarded as correct when the reference token is ROUGE-1.

Implementation Details
We implemented our models using the AllenNLP library (Gardner et al., 2018). In our pointer-generator-based model, we used pre-trained word embeddings for initialization and two-layer BiL-STMs with 256 hidden sizes in both the caption and header-level encoders. We used dropout (Srivastava and Hovy, 2014) with the probability p = 0.1. For optimization in the training phase, we used Adam as the optimizer with a batch size of 10 and a learning rate of 3 × 10 −3 and 3 × 10 −5 in pointer-generator-based and BERT-based, respectively, with a slanted triangular schedule (Howard and Ruder, 2018). We trained the model for a maximum of 20 epochs with early stopping on the validation set (patience of 10) and set α to 0.5. We used the original BERT and the domain-specific SciBERT uncased model to fine-tune our BERTbased model.

Experimental Results
Model comparison The performances of the proposed and baseline models are shown in Table 2. We can see that the Pointer-Generator Supervised-Attention model initialized by Glove embeddings outperformed the baseline in predicting metric-type location. The accuracy of this model in the metric-type generation part mostly scored better than the baseline. However, the performances dropped significantly when the input  was initialized by BERT as well as by SciBERT. BERT and SciBERT embeddings failed to guide our pointer-generator-based model in the metrictype identification task, especially in generating metric-type tokens.
The accuracy of our BERT-based model was significantly better than the others, achieving headerlocation and header-level prediction accuracy of more than 90% and generation accuracy improvement of more than 7 points (%). The fine-tuned BERT-based model using a domain-specific SciB-ERT led to significant performance gains in all metrics.
The effect of copy mechanism We evaluated our pointer-generator-based model using an ablation test, as shown in Table 3. As we can see, the performances of our generation model without a copy mechanism decreased. This demonstrates that incorporating the copy mechanism is beneficial in a metric-type token generation. Our model had the worst accuracy when it ran without a pointer-generator network since the location prediction model alone failed to handle out-of-header metric-types. Table 4 shows the effect of segment embeddings in our BERT-based model. The accuracies of Fine-tuned BERT and the SciBERT model without segment embeddings both decreased. This means that segment embeddings successfully discriminate headerlevel boundaries in the input representation of BERT-based models.

Qualitative Analysis
We analyzed the errors of our pointer-generatorbased and fine-tuned SciBERT models by means of the confusion matrices shown in Tables 5 and  6. For better understanding, we simply define our outputs in the matrices as "LRow" for metric-type located in row headers, "LCol" for metric-type located in column headers, "CCapt" for metric-type copied from the caption, and "Gen" for metric-type generated from the vocabulary. The matrix for the fine-tuned SciBERT model does not include the CCapt class since this model does not contain a copy mechanism.
As shown in the Table 5, the most correct classifications were for copying from the header (row and column), while the highest confusions were for copying from the caption and generation from the vocabulary. The accuracy of generating correct metric-type tokens from the vocabulary was 27.78%, and the accuracy of copying a metric-type from the caption was 75%. The copying mechanism contributes to a better performance than generation one.
From the confusion matrix of the SciBERTbased model in Table 6, we can see that the highest confusion was for copying from the header. We also computed the accuracy of generated metrictype tokens and found that just 58.7% of the generated tokens were correct.
We also investigated the errors in the predicted metric-type tokens. We found that the models tended to generate more generic metric-types; for example, they extracted score as a prediction for the target accuracy. On the other hand, our models generated similar terms to the ground truth metrictype, such as generating the metric-type pearson's for the target r. The examples are shown in Table  7.

Conclusion
In this work, we provided multi-level header numerical table datasets extracted from scientific papers consisting of header tables, captions, and metrictypes. We introduced a metric-type identification task for multi-level header numerical tables, and proposed joint location prediction and generation models to solve the task. We have shown that our proposed model can identify metric-types from the multi-level header tables, both when the metrictypes are included in the headers and when they are not.
Our datasets only cover scientific papers in the computational linguistic domain. The generalization of our results beyond domain still remains an open question due to the difficulties of collecting comparable datasets in other domains without additional annotation by human experts.