DiSCoMaT: Distantly Supervised Composition Extraction from Tables in Materials Science Articles

A crucial component in the curation of KB for a scientific domain (e.g., materials science, food & nutrition, fuels) is information extraction from tables in the domain’s published research articles. To facilitate research in this direction, we define a novel NLP task of extracting compositions of materials (e.g., glasses) from tables in materials science papers. The task involves solving several challenges in concert, such as tables that mention compositions have highly varying structures; text in captions and full paper needs to be incorporated along with data in tables; and regular languages for numbers, chemical compounds, and composition expressions must be integrated into the model. We release a training dataset comprising 4,408 distantly supervised tables, along with 1,475 manually annotated dev and test tables. We also present DiSCoMaT, a strong baseline that combines multiple graph neural networks with several task-specific regular expressions, features, and constraints. We show that DiSCoMaT outperforms recent table processing architectures by significant margins. We release our code and data for further research on this challenging IE task from scientific tables.


Introduction
Advanced knowledge of a science or engineering domain is typically found in domain-specific research papers.Information extraction (IE) from scientific articles develops ML methods to automatically extract this knowledge for curating largescale domain-specific KBs (e.g., (Ernst et al., 2015;Hope et al., 2021)).These KBs have a variety of uses: they lead to ease of information access by domain researchers (Tsatsaronis et al., 2015;Hamon et al., 2017), provide data for developing domainspecific ML models (Nadkarni et al., 2021), and potentially help in accelerating scientific discoveries (Jain et al., 2013;Venugopal et al., 2021).
Significant research exists on IE from text of research papers (see Nasar et al. (2018) for a survey), but less attention is given to IE (often, numeric) from tables.Tables may report the performance of algorithms on a dataset, quantitative results of clinical trials, or other important information.Of special interest to us are tables that mention the composition and properties of an entity.Such tables are ubiquitous in various fields such as food and nutrition (tables of food items with nutritional values, see Tables 1-4 in de Holanda Cavalcanti et al. (2021) and Table 2 in Stokvis et al. (2021)), fuels (constituents and calorific values, see Table 2 in Kar et al. (2022) and Beliavskii et al. (2022)), building construction (components and costs, see Table 4 in Aggarwal and Saha (2022)), materials (constituents and properties, see Table 1 and 2 in Kasimuthumaniyan et al. (2020) and Table 4 in Keshri et al. (2022)), medicine (compounds with weights in drugs, see Table 1 in Kalegari et al. (2014)), and more.
In materials science (MatSci) articles, the details on synthesis and characterization are reported in the text (Mysore et al., 2019), while material compositions are mostly reported in tables (Jensen et al., 2019b).A preliminary analysis of MatSci papers reveals that ∼85% 1 of material compositions and their associated properties (e.g., density, stiffness) are reported in tables and not text.Thus, IE from tables is essential for a comprehensive understanding of a given paper, and for increasing the coverage of resulting KBs.To this extent, we define a novel NLP task of extraction of materials (via IDs mentioned in the paper), constituents, and their relative percentages.For instance, Fig. 1a should output four materials A1-A4, where ID A1 is associated with three constituents (MoO 3 , Fe 2 O 3 , and P 2 O 5 ) and their respective percentages, 5, 38, and 57.A model for this task necessitates solving several 1 estimated by randomly choosing 100 compositions from a MatSci database and checking where they are reported arXiv:2207.01079v3[cs.CL] 24 Jun 2023 challenges, which are discussed in detail in Sec. 3.While many of these issues have been investigated separately, e.g., numerical IE (Madaan et al., 2016), unit extraction (Sarawagi and Chakrabarti, 2014), chemical compound identification (Weston et al., 2019), NLP for tables (Jensen et al., 2019b;Swain and Cole, 2016a), solving all these in concert creates a challenging testbed for the NLP community.
Here, we harvest a distantly supervised training dataset of 4,408 tables and 38,799 compositionconstituent tuples by aligning a MatSci database with tables in papers.We also label 1,475 tables manually for dev and test sets.We build a baseline system DISCOMAT, which uses a pipeline of a domain-specific language model (Gupta et al., 2022), and two graph neural networks (GNNs), along with several hand-coded features and constraints.We evaluate our system on accuracy metrics for various subtasks, including material ID prediction, tuple-level predictions, and material-level complete predictions.We find that DISCOMAT's GNN architecture obtains a 7-15 points increase in accuracy numbers, compared to table processors (Herzig et al., 2020;Yin et al., 2020), which linearize the table for IE.Subsequent analysis reveals common sources of DISCOMAT errors, which will inform future research.We release all our data and code 2 for further research on this challenging task.

Related work
Recent works have developed neural models for various NLP tasks based on tabular data, viz, tabular natural language inference (Orihuela et al., 2021;Minhas et al., 2022), QA over one or a corpus of tables (Herzig et al., 2020;Yin et al., 2020;Arik and Pfister, 2021;Glass et al., 2021;Pan et al., 2021;Chemmengath et al., 2021), table orientation classification (Habibi et al., 2020;Nishida et al., 2017), and relation extraction from tables (Govindaraju et al., 2013;Macdonald and Barbosa, 2020).Several recent papers study QA models-they all linearize a table and pass it to a pre-trained language model.For example, TAPAS (Herzig et al., 2020) does this for Wikipedia tables to answer natural language questions by selecting table cells and aggregation operators.TABERT (Yin et al., 2020) and RCI (Glass et al., 2021) also use similar ideas alongside some architectural modifications to handle rows and columns better.TABBIE (Iida et al., 2021) consists of two transformers that en-2 https://github.com/M3RG-IITD/DiSCoMaTcode rows and columns independently, whereas TAPEX uses encoder-decoder architecture using BART.TABBIE and TAPEX also introduce pretraining over tables to learn table representations better.Similar to our work, tables have also been modeled as graphs for sequential question answering over tables (Müller et al., 2019).However, all these works generally assume a fixed and known structure of tables with the same orientation, with the top row being the header row in all cases -an assumption violated in our setting.Orientation and semantic structure classification: DeepTable (Habibi et al., 2020) is a permutation-invariant neural model, which classifies tables into three orientations, while TabNet (Nishida et al., 2017) uses RNNs and CNNs in a hybrid fashion to classify web tables into five different types of orientations.INFOTABS (Gupta et al., 2020) studies natural language inference on tabular data via linearization and language models, which has been extended to the multilingual setting (Minhas et al., 2022), and has been combined with knowledge graphs (Varun et al., 2022).Some earlier works also focused on annotating column types, entity ID cells, and pair of columns with binary relations, based on rule-based and other ML approaches, given a catalog (Limaye et al., 2010).

Challenges in composition extraction from tables
We analyze numerous composition tables in MatSci research papers (see Figures 1, 6 and 4 for examples), and find that the task has several facets, with many table styles for similar compositions.We now describe the key challenges involved in the task of composition extraction from tables.
• Distractor rows and columns: Additional information such as material properties, molar ratios, and std errors in the same table.E.g., in Figure 1a, the last three rows are distractor rows.
• Orientation of tables:  1a is a column-oriented table.
• Different units: Compositions can be in different units such as mol%, weight%, mol fraction, weight fraction.Some tables express composition in both molar and mass units.
• Material IDs: Authors refer to different materials in their publication by assigning them unique IDs.These material IDs may not be specified every  (Brehault et al., 2014) time, (e.g., Fig. 1c).
• Single-cell compositions (SCC): In Fig. 1a, all compositions are present in multiple table cells.Some authors report the entire composition in a single table cell, as shown in Fig. 1c.
• Percentages exceeding 100: Sum of coefficients may exceed 100, and re-normalization is needed.
A common case is when a dopant is used; its amount is reported in excess.
• Percentages as variables: Contributions of constituents may be expressed using variables like x, y.In Fig. 6 (see App. A), x represents the mol% of (GeBr 4 ) and the 2 nd row contains its value.
• Partial-information tables: It is also common to have percentages of only some constituents in the table; the remaining composition is to be inferred based on paper text or table caption, e.g., Figure 1b.Another example: if the paper is on silicate glasses, then SiO 2 is assumed.
• Other corner cases: There are several other corner cases like percentages missing from the table, compounds with variables (e.g., R 2 O in the header; the value of R to be inferred from material ID), and highly unusual placement of information (some examples in appendix).

Problem formulation
Our goal is automated extraction of material compositions from tables.Formally, given a table T , its caption, and the complete text of publication in which T occurs, we aim to extract compositions expressed in T , in the form {(id, c id k , p id k , u id k )} K id k=1 .Here, id represents the material ID, as used in the paper.Material IDs are defined by MatSci researchers to succinctly refer to that composition in text and other tables.c id k is a constituent element or compound present in the material, K id is the total number of constituents in the material, p id k > 0 denotes the percentage contribution of c id k in its composition, and u id k is the unit of p id k (either mole% or weight%).For instance, the desired out-put tuples corresponding to ID A1 from Figure 1a are (A1, MoO 3 , 5, mol%), (A1, Fe 2 O 3 , 38, mol%), (A1, P 2 O 5 , 57, mol%).

Dataset construction
We match a MatSci DB of materials and compositions with tables from published papers, to automatically provide distantly-supervised labels for extraction.We first use a commercial DB (NGF, 2019) of glass compositions with the respective references.Then, we extract all tables from the 2,536 references in the DB using text-mining API (els).We use a table parser (Jensen et al., 2019a) for raw XML tables and captions.This results in 5,883 tables of which 2,355 express compositions with 16,729 materials, and 58,481 (material ID, constituent, composition percentage, unit) tuples.We keep tables from 1,880 papers for training, and the rest are split into dev and test (see Table 4b).
The DB does not contain information about the location of a given composition in the paper -in text, images, graphs, or tables.If present in a table, it can appear in any column or row.Since we do not know the exact location of a composition, we use distantly supervised train set construction (Mintz et al., 2009).First, we simply match the chemical compounds and percentages (or equivalent fractions) mentioned in the DB with the text in a table from the associated paper.If all composition percentages are found in multiple cells of the table, it is marked as MCC-CI (multi-cell composition with complete information).However, due to several problems (see Appendix 3), it misses many composition tables.To increase the coverage, we additionally use a rule-based composition parser (described below), but restricted to only those compounds (CPD non-terminal in Figure 2) that appear in the DB for this paper.
Our distant supervision approach obtains tablelevel annotation (NC, SCC, MCC-PI, MCC-CI), where a table is labeled as non-composition, single/multi cell composition with partial/complete information.It also obtains annotation for each row or column into four labels: ID, composition, constituent, and other.While training data is created using distant supervision, dev and test sets are hand annotated.We now explain the dataset construction process in further detail.
Rule-based composition parser: The parser helps find names of constituents from MCC tables, and also match full compositions mentioned in SCC tables.Recall that in SCC tables, the full composition expression is written in a single cell in the row/column corresponding to each Material ID.Such compositions are close to regular languages and can be parsed via regular expressions.The first pattern parses simple numbercompound expressions like 40Bi 2 O 3 * 60B 2 O 3 .Here each of the two constituents will match with CST 1 .The other two patterns handle nested compositions, where simple expressions are mixed in a given ratio.The main difference between the second and third patterns is in the placement of outer ratios -after or before the simple composition, respectively.Example match for PAT 2 is (40Bi To materialize the rules of the rule-based composition parser, we pre-label compounds.For our dataset, we use a list-based extractor, though other chemical data extractors (Swain and Cole, 2016b) may also be used.After parsing, all coefficients are normalized so that they sum to hundred.For nested expressions, the outer ratio and the inner ones are normalized separately and then multiplied.
The compositions parsed by rule-based composi-tion parser are then matched with entries in the DB.
A successful matching leads to a high-quality annotation of composition expressions in these papers.
If this matching happens: (i) in a single cell, the table is deemed as SCC, (ii) on caption/paper text that has an algebraic variable (or compound) found in the table, it is marked as MCC-PI (see Figure 1(b)).In case of no matching, the table is marked as NC.This automatic annotation is post-processed into row, column and edge labels.
One further challenge is that material IDs mentioned in papers are not provided in the DB.So, we manually annotate material IDs for all the identified composition tables in the training set.This leads to a train set of 11,207 materials with 38,799 tuples from 4,408 tables.Since the train set is distantly supervised and can be noisy, two authors (one of them is a MatSci expert) of this paper manually annotated the dev and test tables with row/column/edge labels, units, tuples, compositions, and table type, resulting in over 2,500 materials and over 9,500 tuples per set.We used Cohen's Kappa measure for identifying inter-annotator agreement, which was 86.76% for Glass ID, 98.47% for row and column labels, and 94.34% for table types.Conflicts were resolved through mutual discussions.Further statistics and the description of the developed in-house annotation tools used for manual annotations are discussed in A.2.We find that the simplest task is to identify whether the table T is an SCC table, owing to the distinctive presence of multiple numbers, and compounds in single cells.DISCO-MAT first runs a GNN-based SCC predictor, which classifies T as an SCC table or not.For the SCC table, it uses the rule-based composition parser (described in Sec. 5).For the other category, DIS-COMAT runs a second GNN (GNN 2 ), and labels rows and columns of T as compositions, material IDs, constituents, and others.If no constituents or composition predictions are found, then T is deemed to be a non-composition (NC) table.Else, it is an MCC table, for which DISCOMAT predicts whether it has all information in T or some information is missing (partial-information predictor).

DISCOMAT architecture
If it is a complete information table, then GNN 2 's predictions are post-processed into compositions.If not, the caption and text of the paper are also processed, along with GNN 2 's predictions leading to final composition extraction.We note that our system ignores statistically infrequent corner cases, such as single-cell partial information tables -we discuss this further in our error analysis.We now describe each of these components, one by one.

GNN 1 and GNN 2 for table processing
At the core of DISCOMAT are two GNNs that learn representations for each cell, row, column and the whole table.Let table T has R rows, C columns, and text at (i, j) th cell be t ij , where 1 ≤ i ≤ R, and 1 ≤ j ≤ C. We construct a directed graph G T = (V T , E T ), where V T has a node for each cell (i, j), one additional node for each row and column, denoted by (i, 0) and (0, j), respectively, and one node for the whole table represented by (0, 0).There are bidirectional edges between two nodes of the same row or column.All cell nodes have directed edges to the table node and also their corresponding row and column nodes.The table, row, and column embeddings are randomly initialized with a common vector, which gets trained during learning.A node (i, j)'s embedding − → x ij is initialized by running a language model LM over t ij .
As constructed, G T is permutation-invariant, i.e., if we permute rows or columns, we get the same graph and embeddings.However, initial rows/columns can be semantically different, since they often represent headings for the subsequent list.For instance, material IDs are generally mentioned in the first one or two rows/columns of the table.So, we additionally define index embeddings − → p i to represent a row/column numbered i.We use the same index embeddings for rows and columns so that our model stays transpose-invariant.We also observe that while first few indices are different, the semantics is generally uniform for indices higher than 3. Accordingly, to allow DIS-COMAT to handle large tables, we simply use − → p i = − → p 3 ∀i > 3. Finally, any manually-defined features added to each node are embedded as − → f and concatenated to the cell embeddings.Combining all ideas, a cell embedding is initialized as: Here, || is the concat operation and LM CLS gives the contextual embedding of the CLS token after running a LM over the sentence inside .Message passing is run on the graph G T using a GNN, which computes a learned feature vector − → h for every node: i,j=(0,0) .

SCC Predictor
In its pipeline, DISCOMAT first classifies whether T is an SCC table.For that, it runs a GNN (named GNN 1 ) on T with two manually defined features (see below).It then implements a Multi-layer Perceptron MLP 1 over the table-level feature vector − → h 00 to make the prediction.Additionally, GNN 1 also feeds row and column vectors − → h i0 and − → h 0j through another MLP (MLP 2 ) to predict whether they contain material IDs or not.If T is predicted as an SCC table, then one with the highest MLP 2 probability is deemed as material ID row/column (provided probability > α, where α is a hyper-parameter tuned on dev set), and its contents are extracted as potential material IDs.If all row and column probabilities are less than α, then the table is predicted to not have Material IDs, as in Figure 1c.
For an SCC table, DISCOMAT must parse the full composition expression written in a single cell in the row/column corresponding to each Material ID, for which it makes use of the rule-based composition parser (as described in Section 5).The only difference is that at test time there is no DB available and hence extracted compositions cannot be matched with further.Consequently, DISCOMAT retains all extracted composition expresssions from the parser for further processing.
For units, DISCOMAT searches for common unit keywords such as mol, mass, weight, and their abbreviations like wt.%, and at.%.The search is done iteratively with increasing distance from the cell containing the composition.If not found in the table, then the caption is searched.If still not found, mole% is used as default.
Manual Features: GNN 1 uses two hand-coded features.The first feature is set to true if that cell contains a composition that matches our rule-based composition parser.Each value, true or false, is embedded as − → o .The second feature named max frequency feature adds the bias that material IDs are generally unique in a table.We compute q r i and q c j , which denote the maximum frequency of any non-empty string occurring in the cells of row i and column j, respectively.If these numbers are on the lower side, then that row/column has more unique strings, which should increase the probability that it contains material IDs.The computed q values are embedded in a vector as − → q .The embedded feature

MCC-CI and MCC-PI Extractors
If T is predicted to not be an SCC table, DISCO-MAT runs it through another GNN (GNN 2 ).The graph structure is very similar to G T from Section 6.1, but with two major changes.First, a new caption node is created with initial embedding as given by LM processing the caption text.Edges are added from the caption node to all row and column nodes.To propagate the information further to cells, edges are added from row/column nodes to corresponding cell nodes.The caption node especially helps in identifying non-composition (NC) tables.Second, the max frequency feature from Section 6.2 is also included in this GNN.We use tables in Figure 4 as our running examples.While Figure 4a is a complete-information table, Figure 4b is not, and can only be understood in the context of its caption, which describes the composition as Here x and y are variables, which also need to be extracted and matched with the caption.DIS-COMAT first decodes the row and column feature vectors − → h i0 and − → h 0j , as computed by GNN 2 , via an MLP 3 into four classes: composition, constituent, ID, and other (label IDs 1, 2, 3, 0, respectively).The figures illustrate this labelling for our running example.The cell at the intersection of composition row/column and constituent column/row represents the percentage contribution of that constituent in that composition.
Further, to associate the identified percentage contribution with the corresponding constituent (like P 2 O 5 in Figure 4a) or variables x and y in Figure 4b), we perform classification at the edge level.For ease of exposition, we describe our method in this Section 6.3 for the setting that the table has been predicted by GNN 2 to have row-wise orientation, i.e., rows are compositions and columns are constituents.A transposed computation is done in the reverse case.Since the constituent/variable will likely occur in the same column or row as the cell containing percentage contribution, our method computes an edge feature vector: for edge e = (i, j) → (i , j ), s.t.i = i ∨j = j , the feature vector It then takes all such edges e from cell (i, j), if row i is labeled composition and column j is labeled constituent.Each edge e is classified through an MLP 4 , and the edge with the maximum logit value is picked to identify the constituent/variable.This helps connect 36 to P 2 O 5 and 0.8 to x in our running examples.GNN 2 also helps in predicting NC tables.In case none of the rows/columns are predicted as 1 or 2, then the table is deemed as NC and discarded.
Partial information table predictor: Next, DISCOMAT distinguishes between completeinformation (CI) and partial-information (PI) MCC tables.It uses a logistic regression model with custom input features for this prediction task.Let P and Q be the sets of all row indices with label 1 (composition) and column indices with label 2 (constituent), respectively.Also, assume n ij is the number present in table cell (i, j) or 0 if no number is present.To create the features, we first extract all the constituents (compounds) and variables predicted by MLP 4 .We now construct five table-level features (F1-F5).F1 and F2 count the number of unique variables and chemical compounds extracted by MLP 4 .The intuition is that if F1 is high, then it is more likely an MCC-PI, and vice-versa if F2 is high.F3 computes the number of rows and columns labeled as 2 (constituent) by MLP 3 .The more the value of F3, the more likely it is that the table is MCC-CI.Features F4 (and F5) compute the maximum (average) of the sum of all extracted compositions.The intuition of F4 and F5 is that the higher these feature values, the higher the chance of the table being an MCC-CI.Formally, MCC table extractor: For MCC-CI, MLP 3 and MLP 4 outputs are post-processed, and units are added (similar to SCC tables), to construct final extracted tuples.For MCC-PI, on the other hand, information in the text needs to be combined with the MLP outputs for final extraction.The first step here is to search for the composition expression, which may be present in the table caption, table footer, and if not there, somewhere in the rest of the research paper.Here, DISCOMAT resorts to using our rule-based composition parser from Figure 2, but with one key difference.Now, the composition may contain variables (x, y) and even mathematical expressions like 100 − x.So the regular grammar is enhanced to replace the non-terminal NUM with a non-terminal EXPR, which represents, numbers, variables, and simple mathematical expressions over them.An added constraint is that if there are variables in set Q, then those variables must be present in the matched composition expression.DISCOMAT completes the composition by substituting the variable values from every composition row into the matched composition.There may be other types of MCC-PI tables where only compounds are identified in tables, such as Figure 1b.For these, DISCOMAT first computes the constituent contributions in terms of variables from the composition expression, and then equates it with the numbers present in rows/columns labeled 1 (composition).In our example, DISCOMAT matches x with the numbers 10, 20, 30, and 40, and the rest of the composition is extracted by processing the composition expression in the caption with these values of x.Units and material IDs are added to the tuples, similar to other tables.

Constraint-aware loss functions
DISCOMAT needs to train the two GNNs and the PI table predictor.Our data construction provides gold labels for each prediction task (discussed in the next section), so we train them componentwise.The PI table predictor is trained on standard logistic regression loss.GNN 1 is trained on a weighted sum of binary cross entropy loss for SCC table classification and row/column classification for material IDs -weight is a hyper-parameter.Similarly, the GNN 2 loss function consists of the sum of row/column cross-entropy and edge binary cross-entropy losses.
GNN 2 has a more complex prediction problem since it has to perform four-way labeling for each row and column.In initial experiments, we find that the model sometimes makes structural errors like labeling one row as a constituent and another row as a composition in the same table -highly unlikely as per the semantics of composition tables.To encourage GNN 2 to make structurally consistent predictions, we express a set of constraints on the complete labelings, as follows.(1) A row and a column cannot both have compositions or constituents.(2) Composition and material ID must be orthogonally predicted (i.e, if a row has a composition then ID must be predicted in some column, and vice versa).(3) Constituents and material IDs must never be orthogonally predicted (if rows have constituents then another row must have the ID).And, (4) material ID must occur at most once for the entire table.As an example, constraint (1) can be expressed as a hard constraint as: Here, r i and c j are predicted labels of row i and column j.We wish to impose these structural constraints at training time so that the model is trained to honor them.We follow prior work by Nandwani et al. (Nandwani et al., 2019), to first convert these hard constraints into a probabilistic statement.For example, constraint (1) gets expressed as: θ represents GNN 2 's parameters.Following the same work, each such constraint gets converted to an auxiliary penalty term, which gets added to the loss function for constraint-aware training.The first constraint gets converted to: λ R i=1 C j=1 2 l=1 max(0, P (r i = l; θ) + P (c j = l; θ) − 1).This and similar auxiliary losses for other constraints (App.A.1) get added to the GNN 2 's loss function for better training.λ is a hyper-parameter.We also use constraint (4) for GNN 1 training.
Baseline models: We implement DISCOMAT with LM as MATSCIBERT (Gupta et al., 2022), and the GNNs as Graph Attention Networks (Veličković et al., 2018).We compare DISCO-MAT with six non-GNN baseline models.Our first baseline is TAPAS (Herzig et al., 2020), a stateof-the-art table QA system, which flattens the table, adds row and column index embeddings, and passes as input to a language model.To use TaPas for our task, we use table caption as a proxy for the input question.All the model parameters in this setting are initialized randomly.Next, we use TABERT (Yin et al., 2020), which is a pretrained LM that jointly learns representations for natural (NL) sentences and tables by using pretraining objectives of masked column prediction (MCP) and cell value recovery (CVR).It finds table cell embeddings by passing row linearizations concatenated with the NL sentence into a language model and then applying vertical attention across columns for information propagation.Finally, we use TABBIE, which is pretrained by corrupt cell detection and learns exclusively from tabular data without any associated text, unlike the previous baselines.Additionally, we replace the LM of all models with MATSCIBERT to provide domain-specific embeddings to obtain the respective ADAPTED versions.We also implement a simple rule-based baseline for MCC-CI and NC tables.The baseline identifies constituent names using regex matching and a pre-defined list of compounds, extracts numbers from cells and finds the units using simple heuristics to generate the required tuples.Further details on baselines is provided in App.A.3.
Evaluation metrics: We compute several metrics in our evaluation.( 1 3) Tuple-level (TL) F 1 score evaluates performance on the extraction of composition tuples.A gold is considered matching with a predicted 4-tuple if all arguments match exactly.(4) Material-level (MatL) F 1 score is the strongest metric.It evaluates whether all predicted information related to a material (including its ID, all constituents and their percentages) match exactly with the gold.Finally, (5) constraint violations (CV) counts the number of violations of hard constraints in the prediction.We consider all four types of constraints, as discussed in Section 6.4.Implementation details are mentioned in App.A.4.

Results
How does table linearization compare with a graphbased model for our task?To answer this question, we compare DISCOMAT with four models that use linearization: TAPAS, TABERT, and their adapted versions.TAPAS and TABERT do table level and row level linearizations respectively.Since the baselines do not have the benefit of regular expressions, features, and constraints, we implement a version of our model without these, which we call V-DISCOMAT.We do this comparison, trained and tested only on the subset of MCC-CI and NC tables since other table types require regular expressions for processing.As shown in Table 1 V-DISCOMAT obtain 6-7 pt higher F 1 on TL and MatL scores.Moreover, compared to the RULE BASED SYSTEM, DISCOMAT obtains upto 17 points improvement in the MatL F1 score.This experiment suggests that a graph-based extractor is a better fit for our problem -this led to us choosing a GNN-based approach for DISCOMAT.
How does DISCOMAT perform on the complete task?Table 2, reports DISCOMAT performance on the full test set with all table types.Its ID and tuple F 1 -scores are 82 and 70, respectively.Since these errors get multiplied, unsurprisingly, its materiallevel F 1 -score is lower (63.5).Table 3  What is the incremental contribution of taskspecific features and constraints?Table 2 also presents the ablation experiments.DISCOMAT scores much higher than V-DISCOMAT, which does not have these features and constraints.We also perform additional ablations removing one component at a time.Unsurprisingly constrained training helps with reducing constraint violations.Both constraints and features help with ID prediction, due to constraints (2), ( 3), ( 4) and max   frequency feature.Removal of caption nodes significantly hurts performance on MCC-PI tables, as these tables require combining caption with table cells.Although the ablation study done by removing features, constraints, and captions individually does not show much of a difference on the tuplelevel and material-level scores, we observe that on removing all the three factors, the performance of V-DISCOMAT drops significantly.Therefore, we can conclude that even though each component is improving the performance of DISCOMAT marginally, collectively, they help us to achieve significant gains.
What are the typical errors in DISCOMAT?The confusion matrix in Figure 5 suggests that most table-type errors are between MCC-PI and NC tables.This could be attributed to the following reasons.

Conclusions
We define the novel and challenging task of extracting material compositions from tables in scientific papers.This task has importance beyond material science, since many other scientific disciplines use tables to express compositions in their domains.We harvest a dataset using distant supervision, combining information from a MatSci DB with tables in respective papers.We present a strong baseline system DISCOMAT, for this task.It encodes tables as graphs and trains GNNs for table-type classification.Further, to handle incomplete information in PI tables, it includes the text associated with the tables from respective papers.To handle domain-specific regular languages, a rulebased composition parser helps the model by extracting chemical compounds, numbers, units, and composition expressions.We find that our DIS-COMAT baseline outperforms other architectures that linearize the tables by huge margins.In the future, our work can be extended to extract material properties that are also often found in tables.The code and data are made available in the GitHub repository of this work.N. M. Anoop Krishnan acknowledges the funding support received from SERB (ECR/2018/002228), DST (DST/INSPIRE/04/2016/002774), BRNS YSRA (53/20/01/2021-BRNS), ISRO RESPOND as part of the STC at IIT Delhi.Mohd Zaki acknowledges the funding received from the PMRF award by Government of India.Mausam acknowledges grants by Google, IBM, Verisk, and a Jai Gupta chair fellowship.He also acknowledges travel support from Google and Yardi School of AI travel grants.The authors thank the High Performance Computing (HPC) facility at IIT Delhi for computational and storage resources.

Limitations and outlook
DISCOMAT is a pipelined solution trained component-wise.This raises a research question: can we train one end-to-end trained ML model that not only analyzes a wide variety of table structures but also combines the understanding of regular expressions, extraction of chemical compounds and scientific units, textual understanding and some mathematical processing?This defines a challenging ML research question and one that can have a direct impact on the scientific MatSci community.Indeed, automating parts of scientific discovery through such NLP-based approaches has the potential for biases and errors.Note that wrong and biased results can lead to erroneous information about materials.To a great extent, this issue is addressed as we rely only on published literature.The issue could be further addressed by considering larger datasets covering a wider range of materials.

A.1 Constraint-aware training
As discussed in Section 6.4, to encourage GNN 2 to make structurally consistent predictions, we express a set of constraints on the complete labeling as follows.(1) A row and a column cannot both have compositions or constituents.(2) Composition and material ID must be orthogonally predicted (i.e., if a row has a composition, then the ID must be predicted in some column, and vice versa).( 3) Constituents and material IDs must never be orthogonally predicted (that is, if rows have constituents, then another row in the table must have the ID).And, (4) material ID must occur at most once for the entire table.Let r i and c j be the predicted labels of row i and column j.Further, let θ represent GNN 2 's parameters.
As explained in Section 6.4, we convert all these probabilistic statements to an auxiliary penalty term, which gets added to the loss function.

A.2 Dataset details
We We have manually annotated the val and test set, due to the fact that distantly supervised annotations can have noise and are not always 100% accurate.The inter-annotator agreement has already been discussed in 5. Along with the provision of manual annotation, the in-house annotation tools also contained several checks on conditions that shouldn't arise such as: whether the annotator has missed annotating any table, or the annotator has annotated with out-of-range labels or a row/column having both composition and constituent or vice-versa i.e. composition/constituent present in both row and   column of a table.With the help of these selfchecks and mutual discussions on disagreements, we annotated our val and test dataset.Table 4 presents some statistics about our dataset.Table 4a shows the number of tables in our dataset belonging to different table types.Further, Table 4b shows the total number of publications, materials, and tuples in all three splits.We release our code and data under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0) International Public License.

A.3 Baseline models
In this section, we describe the details of our baseline models: TAPAS, TAPAS-ADAPTED, TABERT and TABERT-ADAPTED.Since, the TAPAS (Herzig et al., 2020) architecture has been used for QA over tables and we do not have any questions in the composition extraction task, we use table caption as a proxy for the question.We replace the empty table cells with a special [EMPTY] token.The table caption and text in table cells are converted to word-pieces using the LM tokenizer.Then, we concatenate the word-pieces of the caption and row-wise flattened table.Note that it is possible to obtain more than one word-piece for some table cells.Since the input length after tokenization can be greater than 512, we truncate the minimum possible rows from the end so that the length becomes less than or equal to 512.To avoid a large number of rows getting truncated due to long captions, we truncate the caption so that it only contributes ≤ 100 word-pieces.To differentiate between the table cells belonging to different rows and/or columns, row and column index embeddings are added to the word-piece embeddings in the TAPAS architecture.Position and Segment embeddings are the same as in BERT (Devlin et al., 2019), except that position indexes are incremented when the table cell changes.Original TAPAS architecture also involves adding different Rank embeddings to the input in order to answer rank-based questions.We use the same rank embeddings for every table cell since there is no rank relation among the table cells for our case.All these different types of embeddings are added together and passed through the LM .We take the contextual embedding of the first word-piece of every table cell to be representative of it.Since we do not have row and column nodes here, row and column embeddings are computed by taking the Figure 7 shows the schematic of TAPAS-ADAPTED model.Here, we initialize LM weights with that of MATSCIBERT (Gupta et al., 2022).All other details are the same as in the TAPAS model, except that here we add row and column index embeddings to MATSCIBERT output, instead of input.
For TABERT also, we use the table caption as the proxy for the NL sentence, concatenate it with linearized rows and feed into the TABERT model which generates cell embedding by passing through BERT and applying vertical attention to propagate information across columns.Following the kind of linearization used by TABERT, we linearize each cell as a concatenation of cell type and cell value for each cell, where cell type is divided into numeric, alphanumeric or text.Since DISCOMAT does not use pretraining, we do not use TABERT'S pretrained weights but instead train from initial weights on our row, column and edge-level prediction tasks.We also implement another baseline called TABERT-ADAPTED, which replaces the BERT encoder in TABERT with MATSCIBERT (Gupta et al., 2022) to provide materials science domain's information to the model.In TABBIE, as opposed to TAPAS and TABERT, table cells are passed independently into the LM, instead of being linearized/flattened into a single long sequence.Similar to TABERT, we don't initialize TABBIE's architecture with its pretrained weights for a fair comparison.TABBIE-ADAPTED again replaces the BERT encoder in TABERT with MATSCIBERT (Gupta et al., 2022).

A.4 Implementation details
For Graph Attention Networks (GATs) (Veličković et al., 2018), we use the GAT implementation of Deep Graph Library (Wang et al., 2019).For LMs, TAPAS, we use the implementation by Transformers library (Wolf et al., 2020).We use TABERT's source code from their GitHub repository.We implement and train all models using PyTorch (Paszke et al., 2019) and AllenNLP (Gardner et al., 2017).We optimize the model parameters using Adam (Kingma and Ba, 2015) and a triangular learning rate (Smith, 2017).We further use different learning rates for LM and non-LM parameters (GNNs, MLPs) (App.A.5).To deal with imbalanced labels, we scale loss for all labels by weights inversely proportional to their frequency in the training set.All experiments were run on a machine with one 32 GB V100 GPU.Each model is run with three seeds and the mean and std.deviation is reported.

A.5 Hyper-parameter details
Now, we describe the hyper-parameters of DISCO-MAT.Both GNN 1 and GNN 2 can have multiple hidden layers with different numbers of attention heads.We experiment with hidden layer sizes of 256, 128, and 64 and the number of attention heads as 6, 4, and 2. We include residual connections in GAT, exponential linear unit (ELU) non-linearity after hidden layers, and LeakyRELU non-linearity (with slope α = 0.2) to compute attention weights as done in (Veličković et al., 2018).Training is performed using 8 tables in a batch and we select the checkpoint with the maximum dev MatL F 1 score.
We use a triangular learning rate and choose the peak learning rate for LM to be among 1e-5, 2e-5, and 3e-5 and the peak learning rate for non-LM parameters to be among 3e-4 and 1e-3.A warmup ratio of 0.1 is used for all parameters.We further use batch normalization (Ioffe and Szegedy, 2015) and dropout (Srivastava et al., 2014) probability of 0.2 in all MLPs.We use the same λ for every constraint penalty term.Embedding sizes for features are chosen from 128 and 256 and edge loss weight is selected among 0.3 and 1.0.

A.6 Corner cases
Figure 8 shows examples of some corner case tables.In Figure 8a, elements are being used as variables.Moreover, the values that variables can take are present in a single cell only.Figure 8b shows a table where units occur within the composition itself.Also, mixed units are being used to express the composition.Figure 8c comprises compositions having both elements and compounds.Whereas, we made different REs for element compositions and different REs for compound compositions.Hence our REs are unable to match these.
Figure 9 shows some more examples of corner cases.In Figure 9a, the first compound has to be inferred using the Material IDs.For example, W corresponds to WO 3 and Nb corresponds to Nb 2 O 5 .DISCOMAT makes the assumption that composition is present in a single row/column.Figure 9b refutes this assumption as compositions are present in multiple rows.Sometimes researchers report both theoretical (nominal) and experimental (analyzed) compositions for the same material.The table in Figure 9c lists both types of compositions in the same cell and hence can't be extracted using DISCOMAT.

Figure
Figure 2: Regexes in parser Figure 2 shows the regular expression (simplified, for understandability) used by the parser.Here CMP denotes the matched composition, PATs are the three main patterns for it, CSTs are sub-patterns, CPD is a compound, NUM is a number, and OB and CB are, respectively, open and closed parentheses (or square brackets).W is zero or more whitespace characters, and SEP contains explicit separators like '-' or '+'.START and END are indicators to separate a regular expression from the rest of the text.The first pattern parses simple numbercompound expressions like 40Bi 2 O 3 * 60B 2 O 3 .Here each of the two constituents will match with CST 1 .The other two patterns handle nested compositions, where simple expressions are mixed in a given ratio.The main difference between the second and third patterns is in the placement of outer ratios -after or before the simple composition, respectively.Example match for PAT 2 is (40Bi2 O 3 +60B 2 O 3 )30 -(AgI+AgCl)70, and for PAT 3 is 40Bi 2 O 3 ,40B 2 O 3 ,20(AgI:2AgCl).To materialize the rules of the rule-based composition parser, we pre-label compounds.For our dataset, we use a list-based extractor, though other chemical data extractors(Swain and Cole, 2016b) may also be used.After parsing, all coefficients are normalized so that they sum to hundred.For nested expressions, the outer ratio and the inner ones are normalized separately and then multiplied.The compositions parsed by rule-based composi-

Figure 3 :
Figure 3: The design of DISCOMAT

Figure 4 :
Figure 4: Multi-cell composition tables (a) Complete information (Koudelka et al., 2014) (b) Partial information (Epping et al., 2005) (i) DISCOMAT has difficulty identifying rare compounds like Yb 2 O 3 , ErS 3/2 , Co 3 O 4 found in MCC-PI-these aren't present frequently in the training set.(ii) MCC-PI tables specify dopant percentages found in small quantities.(iii) Completion of composition in MCC-PI tables may require other tables from the same paper.(iv) Finally, MCC-PI composition may contain additional information such as properties that may bias the model to classify it as NC.Some corner cases are given in App.A.6.

Figure 5 :
Figure 5: Confusion matrix for all table types

)
Table-type (TT) prediction accuracy -it computes table-level accuracy on the 4-way table classification as NC, SCC, MCC-CI and MCC-PI.(2) ID F 1 score computes F 1 score for Material ID extraction.( reports DIS-COMAT performance for different table types.In this experiment, we assume that the table type is already known and run only the relevant part of DISCOMAT for extraction.We find that MCC-PI is the hardest table type since it requires combining information from text and tables for accurate extraction.A larger standard deviation in ID F 1 for MCC-PI is attributed to the fact that material IDs occur relatively rarely for this table type -the test set for MCC-PI consists of merely 20 material ID rows and columns.

Table 1 :
Performance of V-DISCOMAT vs baseline models on the subset of data containing only MCC-CI and NC table types.

Table 2 :
Contribution of task-specific features and constraints in DISCOMAT on the complete dataset.

Table 3 :
DISCOMAT performance on the table-types.

Table Type Train
Dev Test

Table 4 :
Number of (a) each of the table types and (b) journals from which the tables are obtained, materials in the tables, and the tuples for the three splits.