Joint Learning of Representations for Web-tables, Entities and Types using Graph Convolutional Network

Existing approaches for table annotation with entities and types either capture the structure of table using graphical models, or learn embeddings of table entries without accounting for the complete syntactic structure. We propose TabGCN, that uses Graph Convolutional Networks to capture the complete structure of tables, knowledge graph and the training annotations, and jointly learns embeddings for table elements as well as the entities and types. To account for knowledge incompleteness, TabGCN’s embeddings can be used to discover new entities and types. Using experiments on 5 benchmark datasets, we show that TabGCN significantly outperforms multiple state-of-the-art baselines for table annotation, while showing promising performance on downstream table-related applications.


Introduction
Table data abounds in webpages and organizational documents. Annotation of table entries, such as columns, cells and rows, using available background knowledge (e.g. Yago, DBPedia, Freebase, etc.), such as knowledge of entities and their types, helps in better understanding and semantic interpretation of such tabular data. The challenge, however, is that such web tables do not adhere to any standard format, schema or convention (Limaye et al., 2010). Additionally, knowledge graphs are typically incomplete -entities and types mentioned in tables may not always exist in the knowledge graph. Therefore, it becomes necessary to expand the knowledge graph with new entities (Zhang et al., 2020) and types for annotating tables.
Initial research on table annotation (Limaye et al., 2010;Takeoka et al., 2019;Bhagavatula et al., 2015) used probabilistic graphical models to capture the complete row-column structure of tables and also the knowledge graph for collective annotation. More recent approaches using embeddings (Gentile et al., 2017;Zhang and Balog, 2018;Zhang et al., 2019;Chen et al., 2019;Yin et al., 2020) only partly capture the syntactic structure of tables, and also ignore the structure of the knowledge graph. The problem of incompleteness of the knowledge representation (Zhang et al., 2020) is mostly not addressed.
In this work, we propose the TabGCN model that uses a Graph Convolutional Network (GCN) (Kipf and Welling, 2017) to unify the complete syntactic structure of tables (rows, columns and cells) and that of the knowledge graph (entities and types) via available annotations. The embeddings of the table elements as well as knowledge graph entities and types are trained jointly and end-to-end. While GCNs have been used for learning embeddings for many NLP tasks using the syntactic and semantic structure of natural language sentences (Marcheggiani and Titov, 2017;Vashishth et al., 2019), encoding tabular structure using GCNs has not been addressed before. The model and embeddings thus trained are used to annotate new tables with known entities and types, while discovering hitherto unseen entities and types. Additionally, we use the trained embeddings for tables and rows for downstream table-related tasks -identifying similar tables, and identifying the appropriate table for any row.
We demonstrate these capabilities of TabGCN using experiments on 5 benchmark web table datasets comparing against 5 existing models. We show that WebGCN significantly improves performance for entity and type annotation. For the other tasks, we show that the same embeddings show impressive performance. No existing model can perform all of these tasks.
Our contributions are as follows: (a) We propose a model called TabGCN based on the GCN architecture that captures the complete syntactic structure of tables as well as the knowledge representation, and learns embeddings of tables, rows, columns and cells, as well as entities and types jointly and in an end-to-end fashion. (b) TabGCN addresses incompleteness in the knowledge representation by discovering new entities and types. (c) TabGCN significantly outperforms 5 existing approaches in 5 different benchmark datasets for the task of table annotation. (d) The trained embeddings show impressive performance in downstream tasks such as identifying similar tables and assignment of rows to appropriate tables.

Related Work
Existing literature on table annotation considers two different types of tables. In general, web tables (Limaye et al., 2010;Takeoka et al., 2019;Bhagavatula et al., 2015) contain mentions of entities under every column, which need to be annotated. In contrast, relational tables (Gentile et al., 2017;Zhang and Balog, 2018;Zhang et al., 2019Zhang et al., , 2020 refer to a single entity in each row, with a single core or anchor column and multiple attribute columns. Here, the entire row is annotated with a single entity. While both tasks are important, our focus is on the first category. In terms of approaches, one category of work considers graphical models to capture table structure and performs joint inference for entity and type classification (Limaye et al., 2010;Takeoka et al., 2019;Bhagavatula et al., 2015). Limaye et al. (2010) uses a Markov Random Field that captures the structure of tables, and creates dependencies between entity and type annotations to jointly classify entities for cells and types for columns. The Markov Random Field (MRF) potentials capture domain knowledge about similarities between cells, between cells and entity lemmas and between entities using the type hierarchy. It cannot handle entities and types not used as labels during training. MeiMei (Takeoka et al., 2019) extends this MRF framework to handle numeric columns and focuses on multi-label classifiers for entities with multiple types. Additionally, it constructs embeddings of entities and types using knowledge graph structure to avoid expensive graph traversal for computing potential function features. However, it still requires manual construction of these features based on domain knowledge. Our GCN architecture is motivated by the structure of the graphical model in these papers. Both these models also jointly label pairs of columns with known relationships, which we do not address in our work.
The second category focuses on embeddings for tables (Gentile et al., 2017;Zhang and Balog, 2018;Zhang et al., 2019). All of these models transform table data to word sequences and then make use of neural language models. Zhang and Balog (2018) make use of RDF2vec for embedding tables, which internally uses neural language models after transforming graphs to sequences. ColNet (Chen et al., 2019) considers columns as cell sequences and uses a CNN to learn the representations of individual cells, which are used to predict entity types for columns. TaBERT (Yin et al., 2020) focuses on retrieving table rows for natural language utterances and learns a joint representation for utterances and cell sequences in table rows using BERT. In summary, these fail to capture the complete rowcolumn structure of tables in the embeddings. In contrast, our GCN architecture captures the structure of all tables, the entities, the types and the training annotations.
For the problem of extending the knowledge graph from tables, Zhang et al. (2020) consider discovery of new entities, but only in the context of relational tables, with a single core column. Their approach make use of pre-trained neural embeddings (word2vec) and cosine similarity in addition to lexical similarity. Our focus is on web tables, where we need to discover new entities and types for all columns in a table. Our approach makes use of GCN embeddings trained using entity and type labels to detect new entities and types.

Problem: Table Annotation
In this section, we first define tables, the background knowledge of entities and types, annotation of tables with entities and types, and, finally, the problem of semi-supervised table annotation with incomplete knowledge.
Tables: We are given a set of tables S. Fig.1 shows an example at the top. The k th table S k ∈ S, consists of m k rows and n k columns of individual cells. The individual cell in the i th row and j th column of the k th table is denoted as x k ij . We consider textual tables only, so that each x k ij takes a string as value. Also, we denote the i th row as R k i and the j th column as C k j .
Entities and Types: We assume background knowledge of entities and entity types (or simply,  types). Let T denote the set of types, and E the set of entities. Each entity E is associated with a type T (E). For each entity E, there is also an entity description or lemma L(E). In Fig.1, the oval nodes shows example entities and types. Entity E123 has associated type T12:Person and lemma Tamara K.. Table Annotations: We assume that tables contain information about entities E and types T . Specifically, each column of each table corresponds to a single type, each cell corresponds to a specific entity of that type. In our example table, cell x 11 is annotated with entity E123, and column C 1 is annotated with type T12:Person. Let T (C k j ) denote the type associated with column C k j , and E(x k ij ) the entity associated with the cell x k ij . Let A e be the set of all entity annotations of cells, and A t that of all type annotations of columns. Table Annotation with Incomplete Knowledge: In the semi-supervised table annotation task, we are given the entire set of tables S but only a subset A o e ⊂ A e of the entity annotations, and a subset A o t ⊂ A t of the type annotations are observed. The task is to annotate the unannotated cells and columns of the tables, using the observed annotations as training data.

Semi-supervised
Let T o denote the set of unique types seen in A o t , and E o the set of unique entities seen in A o e . In the incomplete knowledge setting, T o ⊂ T , indicating that all the types are not seen in the training annotations. Similarly, all the entities are also not seen in training: E o ⊂ E. Now, the task for the unannotated cells and columns is three-fold. The first is to decide whether these correspond to observed entities A o e and observed types A o t . We call this novelty classification. Next, the non-novel table columns need to be annotated with observed types T o , and the non-novel table cells with observed entities E o . We call these type detection and entity detection respectively. Finally, the columns corresponding to novel types need to be grouped according to distinct novel types, and the cells corresponding to novel entities need to be grouped according to distinct novel entities. We call these type discovery and entity discovery respectively.

Joint Table and Knowledge Embedding
We first present at a high level the network architecture of our model, which we call TabGCN. The core components TabGCN are (I) a Graph Convolutional Network (GCN), which captures the various syntactic relationships between table and knowledge elements, and then jointly learns embeddings for these via the convolution operation. The GCN embeddings of table elements contain information about both types and entities. These are fed into two parallel components: (II) the Type Classification component, and (III) the Entity Classification component. The Type Classification component first projects the GCN embedding of table columns to a type space using a type-projection matrix, and then uses a soft-max layer to classify this type embedding according to observed types. Similarly, the Entity Classification component first projects the GCN embeddings of table cells to an entity space using an entity-projection matrix, and then uses a soft-max layer to classify this entity embedding according to observed entities. Fig.2 shows the high level architecture of our model for a subgraph of our example table graph in Fig.1. The parameters of all three components are trained jointly in an endto-end fashion using training entity annotations for cells and type annotations for columns via backpropagation. We next describe these components in greater detail.
Graph Convolutional Network: Graph Convolutional Networks (GCN) (Kipf and Welling, 2017; Gilmer et al., 2017) extend the notion of convolution to graphs. Let G = (V, R) be a directed graph where V is the set of n nodes and R the set of directed edges. An edge between nodes u, v ∈ V with label L uv is denoted as (u, v, L uv ) ∈ R. The edge set includes an inverse edge for each edge and a self-loop for each node. An input feature matrix X ∈ R m×n contains the input feature representation x u ∈ R m of each node ∀u ∈ V in its columns. Output embedding of a node v at k th layer of GCN is given by Luv are label specific model parameters at k th layer and h (1) u = x u . For classification, a linear classification layer is added on top of final GCN layer. Function f () is a non-linear activation for which we used ReLU. Recall that edges in a GCN serve to bring the embeddings of their end nodes closer, the extent being determined by their weight. With this intuition we create the edges R of different types reflecting the underlying semantics of tables and the annotations. These are table edges R t , knowledge edges R k , annotation edges R a , and lexical similarity edges R l . Table edges capture the semantics of web tables, which do not have special anchor columns. These are of four categories: a cell-column edge between a cell node x k ij and its corresponding column node C k j ; a cell-row edge between a cell node x k ij and its corresponding row node R k i ; a column-table edge between a column node C k j and its corresponding table node S k ; and a row-table edge between a row node R k i and its corresponding table node S k . Knowledge edges connect each entity node E o with its corresponding type node T (E o ).

GCN for
Annotation edges are of two categories: an entity annotation edge for each entity annotation in A o e between a cell node x k ij and its labeled entity node E(x k ij ); and a type annotation edge for each type annotation in A o t between a column node C k j and its labeled type node T (C k j ). Lexical similarity edges are added between pairs of cells in the same or different tables whose lexical similarity, computed using character-based Jaccard (Limaye et al., 2010), is above a threshold.
All edges are bi-directional. Each of the 8 edge categories above has its own parameters (W l ) for each layer in our GCN. Self loops are added for nodes associated with textual input, specifically, cells and entities with lemmas. For the input representation of such nodes, we use the pre-trained word embeddings for each of their constituent tokens, and take their mean. For this paper we used GloVe (Pennington et al., 2014).
Type Classification: The final (K th ) GCN layer generates an embedding h (K) v for each node v. This contains information about both types and entities. For type classification of a column node c, we first get its type embedding h t (c) by projecting h (K) c to a type space using a type projection matrix P t : c . Then we get the probability distribution g t (c) over known types T o for the type embedding h t (c) using a soft-max layer with weight matrix θ t : g t (c) = σ(h t (c); θ t ). The type projection matrix P t and the sigmoid weight matrix θ t form the parameters for this component.

Entity Classification:
We follow a similar approach for entity classification of a cell node x. We first project its GCN embedding h (K) x to an entity space using an entity projection matrix P e . The entity embedding h e (x) is passed through a soft-max layer with weight matrix θ e to get the probability distribution g e (x) over known entities E o . The entity projection matrix P e and the sigmoid weight matrix θ e form the parameters for this component.

Joint Training:
The parameters for all three components are trained end-to-end using available entity annotations for cells and type annotations for columns. Specifically, we consider the type predictions g t (c) for columns and minimize classification loss with the observed entity labels T (c). We similarly minimize classification loss between entity predictions g e (x) and observed entity labels E(x) for cells. We consider a weighted combination of the entity prediction loss and the type prediction loss. We have used cross-entropy as the classification loss function, and Adam (Kingma and Ba, 2015) for optimization.

Annotating Tables using Embeddings
Training the network in Sec.4 generates estimates of the model parameters as well as embeddings for all table and knowledge elements. In this section, we describe the use of these parameters and embeddings for the tasks defined in Sec.3. We use h(v) for the GCN embedding of a node v instead of h (K) v for brevity.
Novelty Classification: To decide whether an unannotated column c corresponds to any of the known types in T o (novel type classification), we make use of their type embeddings. Column c corresponds to a new type if its typespace embedding h t (c) = P t h(c) is 'far away' from the type space embedding h t (T ) = P t h(T ) for all T ∈ T o . More specifically, we use δ(max T ∈T o cos(P t h(c), P t h(T )) ≤ t ), where t is the novel type threshold, and δ() is the Kronecker delta function. A similar approach is followed for deciding if an unannotated cell x corresponds to any of the known entities in E o (novel entity classification) by using the entity embeddings. Specifically, we use δ(max E∈E o cos(P e h(x), P e h(E)) ≤ e ), where e is the novel entity threshold.
Type and Entity Detection: Columns and cells determined to be non-novel need to be classified according to known types and entities respectively. This can be done using forward propagation in the trained network on the embeddings of the corresponding nodes. The type prediction g t (c) of a column c is obtained as g t (c) = σ(P t h(c); θ t ). Similarly, the entity prediction g e (x) of a cell x is obtained as g t (x) = σ(P e h(x); θ e ).
Type and Entity Discovery: On the other hand, columns and cells determined to be novel need to be grouped according to distinct new types and entities respectively. This is done by clustering their projections in the appropriate space. Specifically, for type discovery, we take the type embeddings h t (c) of all novel columns c and cluster these. Similarly, for entity discovery, we cluster the entity embeddings h e (x) of all novel cells x. The clustering algorithm needs to automatically determine the number of clusters in both cases. In this paper, we have used Silhouette clustering (Rousseeuw, 1987) as a representative non-parametric clustering algorithm. Other algorithms approaches such as Bayesian non-parametric techniques (Teh, 2010) may be used here.

Down-stream Inference for Rows and Tables:
Training annotations are only provided for cells and columns. But embeddings are available for the rows and tables as well after training, and these can be used for different down-stream application tasks. Since these do not involve type or entity spaces, we directly use their GCN embeddings for these tasks. As examples, we define two such tasks here. Table  clustering is the task of grouping together semantically related tables. For this, we cluster the table embeddings h(S) of all tables S ∈ S. For consistency, we again use Silhouette clustering. Row to Table assignment is the task of assigning a row to its most appropriate table. For this, we assign a row R with embedding h(R) to the table S k with the 'closest' embedding h(S k ), or, more specifically, to S * = arg max S k cos(h(R), h(S k )). If R is a row from a table provided during training, then its embedding is readily available. If it is a new row, then its embedding is created by convolving over the input embeddings of its constituent cells using the trained parameters W l for cell-row edge.

Experiments
In this section, we first present experimental results for table annotation, and then for down-stream table-related tasks. We compare our proposed model TabGCN with appropriate state-of-the-art baselines. We have uploaded our source code as supplementary material for reproducibility.   Efthymiou et al. (2017). This has HTML tables from Wikipedia 2013 with class attribute "wikitable". The last three datasets are annotated using DBPedia. But T2Dv2 and Wikipedia contain only type annotations, and entity annotations are not available. For all datasets, we first set aside 30% of the unique entities and types, and effectively all their annotations as unseen during training. Of the remaining annotations, again 30% was removed during training.
Baselines: We compare TabGCN against three existing table annotation approaches. PGM (Limaye et al., 2010) uses a probabilistic graphical model and hand-crafted features. Since our entitytype set T is flat instead of being a hierarchical graph, we use logistic regression instead of structural SVM to estimate model parameters. MeiMei (Takeoka et al., 2019) also uses a Markov Random field but embeds the entities and types for fast computation of clique potentials. For both models, we ignore the relation labels for column-pairs and associated potential functions, since we do not address relation detection. Table2Vec (Zhang et al., 2019) learns various word2vec-based embeddings for tables for entity annotation. It does not address type annotation. It focuses on relational tables and associates a single entity with a row for its core column. We adapted this model for web tables which associate an entity for each cell. We use their Table2VecE version, which models a row as a sequences of all entities that appear within cells of that row. After training the word2vec model, instead of considering the embedding for the entire row as in Table2VecE, we create cell-specific embeddings from only the tokens for that cell. Col-Net (Chen et al., 2019) models each column as a sequence of words from the cells under the column and uses a CNN to predict the column type. We also use an adaptation of TaBERT (Yin et al., 2020) which trains a joint language model for retrieval from tables given utterances. We adapt their approach to independently linearize each row and column of a table as a sequence of cells under that row and column, and get column and row embeddings using the mean of corresponding cell embeddings. For cells, we use pre-trained BERT embeddings.
Hyper-parameters: We used one-hot encodings as inputs for cells and entities with lemmas. We used ReLU as the nonlinear GCN activation, 2 GCN layers with 512 and 256 dimensions. As a result, the 8 GCN weight matrices were Vx512 in the first layer, where V is the vocabulary size,   Detection Results: We first present results for entity and type detection, addressed by most of the baselines. For this task, all models predict entity labels for all cells and type labels for all columns that are unannotated. However, evaluation is only for those cells and columns whose true entity and type labels are contained in the observed entities E o and observed types T o respectively. For evaluation, as in earlier table annotation papers (Limaye et al., 2010;Chen et al., 2019), we use micro-averaged F1, which takes the weighted average of the F1 scores over the entities or types. This takes class imbalance into account and is therefore more appropriate than accuracy. We note that MeiMei addresses the multi-label setting with multiple possible type labels, and therefore uses ranking evaluation measures (e.g. NDCG). This is not meaningful in our single-labeled setting.
Tab. 1 shows detailed results for all models across datasets. We can see that TabGCN signifi- cantly outperforms all baselines on all datasets for both detection tasks. The only exception is for Wiki Manual. The graphical model based approaches with handcrafted potential functions outperform the representation learning approaches, possibly on account of the smallness of the dataset. Among the embedding based approaches, TabGCN performs the best.
Novelty Classification Results: In our second experiment, we consider novelty classification. This is an unsupervised task, where a model makes a binary decision for each unannotated column (novel type classification) and for each unannotated cell (novel entity classification). Since the decision depends on the thresholds for type ( t ) and entity ( e ), we plot F1 score on the y-axis against the corresponding threshold on the x-axis. Of the baselines, only PGM can address this task, but outputting a NONE label for the type or entity. Fig.3 between the assigned cluster labels (Y) and the true entity or type labels (C), where I(, ) denotes mutual information and H() denotes entropy. In Tab. 2, we see that TabGCN performs consistently above 80% for entity and type discovery across datasets, significantly outperforming ColNet and TaBERT.
Ablation Study: Next, we analyze the performance of TabGCN, using multiple ablations of the full model. -K leaves out the knowledge nodes and their incident edges from the GCN graph during training. -E focuses only on types by removing all entity nodes and entity-related edges (type-entity and cell-entity annotation) from the GCN graph. It is trained only using type loss. Note that it cannot perform tasks associated with entities, specifically entity detection, novel entity classification and novel entity discovery. Similarly, -T focuses only on entities by removing all type nodes and type-related edges (type-entity and column-type annotation) from the GCN graph. It is trained only using entity loss, and cannot perform tasks associated with types. Finally, -L removes the lexical similarity edges from the GCN graph. The results are recorded in Table. 5. We can see that all the components of the architecture contribute to performance improvements. The improvement is statistically significant (using the Wilson Interval with α = 0.05) for all ablations other than -K. While -K performs comparably here, its performance drops  significantly for novelty classification, as can be seen in Fig.3.
Downstream We finally consider row-to-table assignment. In this task, one randomly selected row is removed from every table during training. These rows then need to be classified according to their parent table.
Since the models output a ranking of tables for each row, we evaluate using two ranking related measures. Hits@1 measures the fraction of rows with the correct table at rank 1. Mean Reciprocal rank (MRR) is the mean of the reciprocal rank of the correct table over all rows, and its perfect value is 1.0. In Tab. 3, we again see that TabGCN performs the best across datasets.
In summary, we have demonstrated the usefulness of learning embeddings of table elements and knowledge elements jointly using both entity and type losses in an end-to-end fashion for type and entity annotation on 5 benchmark datasets. In addition, we have demonstrated how the learned embeddings can be useful for downstream table-related tasks. In all cases, TabGCN has significantly outperformed multiple state-of-the-art baselines using probabilistic graphical models as well as other neural approaches.

Conclusion
We have proposed a model for that jointly learns representations of tables, rows, columns and cell, as well as entities and types by capturing the complete syntactic structure of all tables, the relevant entities and types and the available annotations using the Graph Convolutional Network. As a result, TabGCN unifies the benefits of probabilistic graphical model based approaches and embedding based approaches for table annotation. Using these embeddings, TabGCN significantly outperforms existing approaches for table annotation, as well as entity and type discovery.