Text-to-Table: A New Way of Information Extraction

We study a new problem setting of information extraction (IE), referred to as text-to-table. In text-to-table, given a text, one creates a table or several tables expressing the main content of the text, while the model is learned from text-table pair data. The problem setting differs from those of the existing methods for IE. First, the extraction can be carried out from long texts to large tables with complex structures. Second, the extraction is entirely data-driven, and there is no need to explicitly define the schemas. As far as we know, there has been no previous work that studies the problem. In this work, we formalize text-to-table as a sequence-to-sequence (seq2seq) problem. We first employ a seq2seq model fine-tuned from a pre-trained language model to perform the task. We also develop a new method within the seq2seq approach, exploiting two additional techniques in table generation: table constraint and table relation embeddings. We consider text-to-table as an inverse problem of the well-studied table-to-text, and make use of four existing table-to-text datasets in our experiments on text-to-table. Experimental results show that the vanilla seq2seq model can outperform the baseline methods of using relation extraction and named entity extraction. The results also show that our method can further boost the performances of the vanilla seq2seq model. We further discuss the main challenges of the proposed task. The code and data are available at https://github.com/shirley-wu/text_to_table.


Introduction
Information extraction (IE) is a task that aims to extract information of interest from text data and represent the extracted information in a structured form. Traditional IE tasks include named entity recognition which recognizes entities and their types (Huang, Xu, and Yu 2015;Ma and Hovy 2016;Lample et al. 2016;Devlin et al. 2019), and relation extraction which identifies the relationships between entities (Zheng et al. 2017;Zeng et al. 2018;Luan et al. 2019;Zhong and Chen 2020). Since the results of IE are structured, they can be easily used by computer systems in different applications such as text mining.
In this work, we study IE in a new setting, referred to as text-to-table. First, the system receives a training dataset containing text-table pairs. Each text-table pair contains a text and a table (or tables) representing information extracted * Work done in Bytedance AI Lab Figure 1: A example of text-to-table from the Rotowire dataset. The text is a report of a basketball game, and the tables are the scores of the teams and players.
from the text. The system learns a model for information extraction. Next, the system employs the learned model to conduct information extraction from a new text and outputs the result in a table (or tables). Figure 1 gives an example of text-to-table, where the input (above) is a report of a basketball game, and the output (below) is two tables summarizing the scores of the teams and players from the input.
Our work is inspired by research on the so-called table-totext (or data-to-text) problem, which is the task of generating a description for a given table. Table-to-text is useful in applications where the content of a table needs to be described in natural language. Thus, text-to-table can be regarded as an inverse problem of table-to-text. However, there are also differences. Most notably, their applications are different. Text-to-table can be applied to document summarization, text mining, etc.
Text-to-table is unique compared to the traditional IE approaches. First, it is mainly designed to extract information on complex relations between items from a long text (e.g., the whole document). Second, the schemas for extraction are implicitly included in the training data, and there is no need to explicitly define the schemas. As a result, one can easily build a system to perform the task.
In this work, we formalize text-to-table as a sequenceto-sequence (seq2seq) task. More specifically, we translate the text into a sequence representation of a table (or tables), where the schema of the table is implicitly contained in the representation. We also assume that the seq2seq model is built on top of a pre-trained language model, such as BART (Lewis et al. 2019) and T5 (Raffel et al. 2020). Although the approach is a natural application of existing technologies, as far as we know, there has been no previous study to investigate to what extent the approach works. We also develop a new method for text-to -table within the seq2seq approach with  two additional techniques, table constraint and table relation  embeddings. Table constraint controls the creation of rows in  a table and table relation embeddings affect the alignments  between cells and their row headers and column headers. Both are to make the generated table well-formulated.
The approach to IE based on seq2seq has already been proposed. Methods for conducting individual tasks of relation extraction (Zeng et al. 2018;Nayak and Ng 2020), named entity recognition (Chen and Moschitti 2018;Yan et al. 2021), and event extraction (Lu et al. 2021) have been developed. Methods for jointly performing multiple tasks of named entity recognition, relation extraction, and event extraction have also been devised (Paolini et al. 2021). Most of the methods exploit suitable pre-trained models such as BERT. However, all the existing methods rely on pre-defined schemas for extraction. Moreover, their models are designed to extract information from short texts, rather than long texts, and extract information with simple structures (such as an entity and its type), rather than information with complicated structures (such as a table).
We conduct extensive experiments on four existing tableto-text datasets. Experimental results show that the vanilla seq2seq model fine-tuned from BART (Lewis et al. 2019) can outperform the state-of-the-art IE models fine-tuned from BERT (Devlin et al. 2019;Zhong and Chen 2020). Furthermore, results show that our proposed approach to text-to-table with the two techniques can further improve the extraction accuracies. We also summarize the challenging issues with the seq2seq approach to text-to-table for future research.
We make the following contributions in this work: 1. We propose the new task of text-to-table for IE. We derive four new datasets for the task from existing datasets. 2. We formalize the task as a seq2seq problem and propose a new method within the seq2seq approach using the techniques of table constraint and table relation embeddings. 3. We conduct extensive experiments to verify the effectiveness of the proposed approach.

Related Work
Information Extraction (IE) is a task of extracting information (structured data) from a text (unstructured data). For example, named entity recognition (NER) recognizes entities appearing in a text. Relation extraction (RE) identifies the relationships between entities. Another example is event extraction (EE), which discovers events occurring in a text. Traditionally, researchers formalize the task as a language understanding problem. The state-of-the-art methods for NER perform the task on the basis of the pre-trained language model BERT (Devlin et al. 2019). The pipeline approach to RE divides the problem into NER and relation classification, and conducts the two sub-tasks in a sequential manner (Zhong and Chen 2020), while the end-to-end approach jointly carries out the two sub-tasks (Zheng et al. 2017;Zeng et al. 2018;Luan et al. 2019). The state-of-the-art methods for EE also employ BERT and usually jointly train the models with other tasks such as NER and RE Zhang, Ji, and Sil 2019;Lin et al. 2020). All the methods assume the use of pre-defined schemas (e.g., entity types for NER, entity and relation types for RE, and event templates for EE). Besides, most methods are designed for extraction from short texts. Therefore, existing methods for IE cannot be directly applied to text-to-table.
IE is also conducted at document level, referred to as doclevel IE. For example, some NER methods directly perform NER on a long document (Strubell et al. 2017;Luo et al. 2018), and others encode each sentence in a document, use attention to fuse document-level information, and perform NER on each sentence (Hu et al. 2020;Xu, Wang, and He 2018). There are also RE methods that predict the relationships between entities in a document (Yao et al. 2019;Nan et al. 2020a). However, existing doc-level IE approaches usually do not consider extraction of complex relations between many items.
Sequence-to-sequence (seq2seq) is the general problem of transforming one text into another text (Sutskever, Vinyals, and Le 2014;Bahdanau, Cho, and Bengio 2014), which includes machine translation, text summarization, etc as special cases. The use of the pre-trained language models of BART (Lewis et al. 2019) and T5 (Raffel et al. 2020) can significantly boost the performances of seq2seq, such as machine translation (Lewis et al. 2019;Raffel et al. 2020;Liu et al. 2020) and text summarization (Lewis et al. 2019;Raffel et al. 2020;Huang et al. 2020).
Recently, some researchers also formalize the IE problems as seq2seq, that is, transforming the input text into an internal representation. One advantage is that one can employ a single model to extract multiple types of information. Experimental results show that this approach works better than or equally well as the traditional approach of language understanding, in RE (Zeng et al. 2018;Nayak and Ng 2020), NER (Chen and Moschitti 2018;Yan et al. 2021) and EE (Lu et al. 2021).
Data-to-text aims to generate natural language descriptions from the input structured data such as sport commentaries (Wiseman, Shieber, and Rush 2017). The structured data is usually represented as tables (Wiseman, Shieber, and Rush 2017;Thomson, Reiter, and Sripada 2020;Chen et al. 2020), sets of table cells (Parikh et al. 2020;Bao et al. 2018), semantic representations (Novikova, Dušek, and Rieser 2017), or sets of relation triples (Gardent et al. 2017;Nan et al. 2020b). The task requires the model to select the salient information from the data, organize it in a logical order, and generate an accurate and fluent natural language description (Wiseman, Shieber, and Rush 2017). Data-totext models usually adopt the encoder-decoder architecture. The encoders are specifically designed to model the input data, such as multi-layer perceptron (Puduppully, Dong, and Lapata 2019a,b), recurrent neural network (Juraska et al. 2018;Liu et al. 2018;Shen et al. 2020

Problem Formulation
Text-to-table takes a text as input and produces one or several tables to summarize the content of the text, as shown in Figure 1. This can be considered as an inverse problem of data-to-text, and each has its applications.
The input is a text denoted as x = x 1 , x 2 , · · · , x |x| . The output is one table or multiple tables. For simplicity suppose that there is only one table denoted as T . Further suppose that T has n r rows and n c columns. There are n r × n c cells in T , where the cell of row i and column j is a sequence of There are three types of table: one that has both column headers and row headers, one that has only column headers, and one that only has row headers. For example, the player table in Figure 1 has both column headers ("Assists", "Points", etc) and row headers ("Al Horford", "Isaish Thomas", etc). We let t 1,j , j = 2, 3, · · · , n c to denote the column headers, let t i,1 , i = 2, 3, · · · , n r to denote the row headers, and let t 1,j , i = 2, 3, · · · , n r , j = 2, 3, · · · , n c to denote the nonheader cells of the table. For example, in the player table in Figure 1, t 1,2 = Assists, t 2,1 = Al Horford, and t 2,2 = 5.
We consider using machine learning to perform text-totable. In learning, a number of text-table pairs are given as training data, and a model is trained from the data. In inference, the learned model is utilized to generate a table or tables given a new text.
Once the information of a text is extracted into tables via text-to-table, it can be leveraged in many different applications such as document summarization and text mining. For example, in Figure 1, one can quickly obtain the key information of the text by simply looking at the tables summarized from the text.
There are differences between text-to-table and traditional settings of information extraction. As can be seen from the example in Figure 1, extraction of information is performed from the entire document. The extracted information (structured data) is in a complex form, specifically multiple types of scores of teams and players in a basketball game. Furthermore, the data-driven approach is taken, and the schemas of the tables do not need to be explicitly defined.

Our Method
We develop a method for text-to-table using the seq2seq approach and the two techniques of table constraint and table relation embeddings.

Seq2Seq Framework
We formalize text-to-table as a sequence-to-sequence (seq2seq) problem (Sutskever, Vinyals, and Le 2014;Bahdanau, Cho, and Bengio 2014). Specifically, given an input text, we generate a sequence representing the output table (or tables). We introduce two special tokens, a separation token denoted as " s " and a new-line token denoted as " n ". For a table t, we represent each row t i with a sequence of cells delimited by separation tokens: We represent the entire table with a sequence of rows delimited by new-line tokens: s , t 2,1 , s , · · · , s , t 2,nc , s , n , · · · · · · s , t nr,1 , s , · · · , s , t nr,nc , s Figure 2 shows the sequence of the player table in Figure 1. When there are multiple tables, we create a sequence of tables using the captions of the tables as delimiters.
Let x = x 1 , · · · , x |x| and y = y 1 , · · · , y |y| denote the input and output sequences respectively. In inference, the model generates the output sequence based on the input sequence. We conduct decoding in an auto-regressive way, which generates one token at each step based on the tokens it has generated so far. Formally, the model calculates the conditional probability: We use decoding algorithms such as beam search or greedy search to find an output sequence that approximately maximizes the conditional probability.
In training, we learn the model based on the text-table pairs {(x 1 , y 1 ), (x 2 , y 2 ), · · · , (x n , y n )}. The objective of learning is to minimize the cross-entropy loss arg min where θ denotes the parameter of the model. In our work, we adopt Transformer as the model (Vaswani et al. 2017), which is the state-of-the-art method for seq2seq. We build the model on top of the pre-trained language model BART (Lewis et al. 2019) by fine-tuning.
We refer to the method described above as "vanilla seq2seq". There is no guarantee, however, that the output sequence of vanilla seq2seq represents a well-formulated table. We add a post-processing method to ensure that the output sequence is a table. The post-processing method takes the first row generated as well-defined, deletes extra cells at the end of the other rows and inserts empty cells at the end of the other rows.

Techniques
We develop two techniques to improve table generation, called table constraint and table relation embeddings. We use "our method" to denote the seq2seq approach with these two techniques.

Table Constraint
Our method exploits a constraint in the decoding process to ensure that the output sequence represents a well-formulated table. Specifically, our method calculates the number of cells in the first row it generates, and then forces the following rows to contain the same number of cells. The algorithm of the decoder is shown in Algorithm 1.

Table Relation Embeddings
Our method also incorporates table relation embeddings including row relation embeddings and column relation embeddings into the self-attention of the Transformer decoder. Given a token in a non-header cell, the row relation embeddings τ K r and τ V r indicate which row header the token is aligned to, and the column relation embeddings τ K c and τ V c indicate which column header the token is aligned to.
Let us consider the self-attention function in one block of Transformer decoder: at each position, self-attention only attends to the previous positions. For simplicity, let us only consider one head in the self-attention. At the t-th position, the input of self-attention is the sequence of representations z = (z 1 , · · · , z t ) and the output is the sequence of representations h = (h 1 , · · · , h t ), where z i ∈ R d and h i ∈ R d are the representations at the i-th position (i = 1, · · · , t).
In a conventional Transformer decoder, self-attention is defined as follows, i = 1, · · · , t, j = 1, · · · , i where W Q , W K , W V ∈ R d×d k are the query, key, and value weight matrices respectively, and W O ∈ R d k ×d is the output weight matrix.
Algorithm 1: Decoding using table constraint. eos , s , and n denote the end of sentence, separation token, and new-line token respectively. Seq2seq denotes the seq2seq model. Decode denotes the decoding algorithm such as beam search and greedy search.
The relation vectors r K ij and r V ij are defined as follows. For the token at the i-th position, if the token at the j-th position is a part of its row header, then r K ij and r V ij are set to the row relation embeddings τ K r and τ V r . Similarly, for the token at the i-th position, if the token at the j-th position is a part of its column header, then r K ij and r V ij are set to the column relation embeddings τ K c and τ V c . Otherwise, r K ij and r V ij are  set to 0. To identify the row header or the column header of a token, we parse the sequence generated so far to create a partial table using the new-line tokens and separation tokens in the sequence. Figure 3 illustrates how relation vectors are constructed.

Datasets
We make use of four existing datasets which are traditionally utilized for data-to-text: Rotowire (Wiseman, Shieber, and Rush 2017), E2E (Novikova, Dušek, and Rieser 2017), Wik-iTableText (Bao et al. 2018), and WikiBio (Lebret, Grangier, and Auli 2016). In each dataset, we filter out the content in the tables that does not appear in the texts. We plan to make the processed datasets publicly available for future research. Table 2 gives the statistics of the Rotowire dataset and Table 1 gives the statistics of the other three datasets.
Rotowire is from the sports domain. Each instance is composed of a text and two tables, where the text is a report of a basketball game and the two tables represent the scores of teams and players respectively (cf., Figure 1). Each table has column headers describing the types of scores, and row headers describing the names of teams or players. The texts are long and may contain irrelevant information such as the performance of players in other games. Thus this is a challenging dataset for text-to-

Procedure
Methods: We conduct experiments with vanilla seq2seq and our method, as well as baselines.
We know of no existing method that can be directly employed in text-to-table. For each dataset, we first define the schemas based on the training data, then use an existing method of relation extraction (RE) or named entity extraction (NER) to extract information, and finally create tables based on the schemas and extracted information. We take it as the baseline for the dataset. No baseline can be applied to all four datasets. For RE, we use PURE, a state-of-the-art method (Zhong and Chen 2020). For NER, we use BERT model (Devlin et al. 2019).
Training: For vanilla seq2seq and our method, we finetune the seq2seq models from BART-base, which has 12 layers, 768 hidden dimensions, 16 heads, and 139M parameters. For RE and NER, we fine-tune the models from BERT-baseuncased, which has 12 layers, 768 hidden dimensions, 12 heads, and 110M parameters.
All models are trained with the Adam optimizer until convergence, and the hyper-parameters are tuned on the development sets. Appendix A shows the hyper-parameters. For the small datasets of Rotowire and WikiTableText, we run experiments five times with different random seeds and take average of results to reduce variance.   Table 4: Experimental results of our method, vanilla seq2seq, and the baseline of NER, on WikiTableText and WikiBio.
Evaluation: We evaluate the performance of a method based on the number of correct non-empty cells in the tables (i.e., we ignore empty cells). To judge whether a cell is correctly generated in the table, we use not only its content but also its row header and column header to ensure that the cell is on the right row and right column. Exact match is used to compare the content of the generated cell and the ground truth. We adopt precision, recall, and F1 score as evaluation measures. We calculate the measures on each generated table and then take the average on all tables. We also evaluate the percentage of output sequences that cannot represent wellformulated tables, referred to as error rate. Table 3 shows the results on the Rotowire dataset. One can see that in terms of F1 score, our method performs the best followed by vanilla seq2seq, and both outperform the baselines of doc-level RE and sent-level RE. The RE baselines perform quite well, but they heavily rely on rules and cannot beat the seq2seq approach. Among them the doc-level RE performs better than sent-level RE, because some information in Rotowire can only be extracted when cross-sentence context is provided.

Results on Rotowire
We implement two baselines of RE, namely doc-level RE and sent-level RE. We take team names, player names, and numbers of scores as entities and take types of scores as relations. Sent-level RE predicts the relations between entities within each sentence. Doc-level RE predicts the relations between entities within a window (the window size is 12 entities) and uses the approximation model proposed by Zhong and Chen (2020) to speed up inference. Table 4 shows the results of our method, vanilla seq2seq, and the baseline of NER on E2E, WikiTableText, and WikiBio. Again, the seq2seq approach outperforms the baseline. The NER baseline has slightly higher precision, but the seq2seq approach has significantly higher recall and F1. Our method and vanilla seq2seq are comparable, because the table structures in the three datasets are very simple (there are only two columns in the tables), and the use of the two techniques does not further improve the performances. The NER baseline has high precision but low recall, mainly because NER can only make the right decision when it is clear.

Results on E2E, WikiTableText and WikiBio
We implement the baseline of NER in the following way. We view the non-head cells in the tables as entities and their row headers as entity types. In training, we match the nonhead cells into the texts and take them as "entities" in the texts. Only a proportion of the non-header cells can be matched into the texts (85% for E2E, 74% for WikiTableText, and 69% for WikiBio).

Additional Study
We carry out ablation study on our method, specifically we exclude pre-trained language model, table constraint (TC) and table relation embeddings (TRE) from our method. Note that our method without TC and TRE is equivalent to vanilla seq2seq. Table 5 gives the results on the four datasets.
It can be seen that the use of both TC and TRE can significantly improve the performance on Rotowire, which indicates that our method is particularly effective when the tables are large with many rows and columns. There are not significant improvements on E2E, WikiTableText, and WikiTableText, apparently because formulation of tables is easy for the three datasets. Therefore, we conclude that the two techniques of TC and TRE are helpful when the task is difficult.
The use of pre-trained language model can boost the performances on all datasets, especially on Rotowire and Wik-iTableText. This indicates that pre-trained language model is particularly helpful when the task is difficult and the size of training data is small.
We observe that vanilla seq2seq makes more formatting errors than our method, especially on player tables in Rotowire that have a large number of columns. It indicates that for vanilla seq2seq, it is difficult to keep track of the columns in each row and make alignments with the column headers. In contrast, the two techniques of our method can help effectively cope with the problem. Figure 4 shows a bad case of vanilla seq2seq, where the model correctly infers the column of "assists" but fails to infer the columns of "personal fouls", "points", and "total rebounds" for the row of "Rajon Rondo". In contrast, our method can successfully handle the case, be-    cause TC can eliminate the incorrectly formatted output, and TRE can make correct alignments with the column headers. We also investigate the effect of the scale of pre-trained language model BART. We use both BART-base and BARTlarge and conduct fine-tuning on top of them for vanilla seq2seq and our method. Table 6 gives the results on the four datasets. The results show that the use of BART-large can further boost the performances on all four datasets, indicating that it is better to use larger pre-trained models when computation cost is not an issue.

Discussions
We analyze the experimental results on the four datasets and identify five challenging issues.
(1) Text Diversity: Extraction of the same content from different expressions is one challenge. For example, the use of synonyms is very common in Rotowire. The team of "Knicks" is often referred to as "New York", its home city. Identification of the same entities from different expressions is needed in the task.
(2) Text Redundancy: There are cases such as those in Wik-iBio, in which the texts contain much redundant information. This poses a challenge to the text-to-table model to have a strong ability in summarization. It seems that the seq2seq approach works well to some extent but further improvement is undoubtedly necessary.
(3) Large Table: The tables in Rotowire have large numbers of columns, and the extraction from them is challenging even for our method of using TC and TRE.
(4) Background Knowledge: WikiTableText and WikiBio are from open domain. Thus, performing text-to-table on such kind of datasets require the use of much background knowledge. The use of more powerful pre-trained language models can be further explored in the future.
(5) Reasoning: Sometimes the information is not explicitly presented in the text, and reasoning is required to conduct correct extraction. For example, an article in Rotowire reports a game between the two teams "Nets" and "Wizards". From the sentence: "The Nets seized control of this game from the very start, opening up a 31 -14 lead after the first quarter", humans can infer that the point of "Wizards" is 14, which is still difficult for machines.

Conclusion
We propose employing text-to-table as a new way of information extraction (IE), which extracts information of interest from the input text and summarizes the extracted information in tables. The advantage of the approach is that one can easily conduct information extraction from either short texts or long texts to create simple tables or complex tables without explicitly defining the schemas. Text-to-table can be viewed as an inverse problem of table-to-text. We formalize text-to-table as a sequence-to-sequence problem on top of a pre-trained model. We further propose an improved method using a seq2seq model and table constraint and table relation embeddings techniques. We conduct experiments on four datasets derived from existing table-to-text datasets. The experimental results demonstrate that our proposed approach outperforms existing methods using conventional IE techniques. We further analyze the challenges of text-to-table for future study. The issues include diversity of text, redundancy of text, large-table, background knowledge, and reasoning.