CoVA: Context-aware Visual Attention for Webpage Information Extraction

Webpage information extraction (WIE) is an important step to create knowledge bases. For this, classical WIE methods leverage the Document Object Model (DOM) tree of a website. However, use of the DOM tree poses significant challenges as context and appearance are encoded in an abstract manner. To address this challenge we propose to reformulate WIE as a context-aware Webpage Object Detection task. Specifically, we develop a Context-aware Visual Attention-based (CoVA) detection pipeline which combines appearance features with syntactical structure from the DOM tree. To study the approach we collect a new large-scale datase of e-commerce websites for which we manually annotate every web element with four labels: product price, product title, product image and others. On this dataset we show that the proposed CoVA approach is a new challenging baseline which improves upon prior state-of-the-art methods.


Introduction
Webpage information extraction (WIE) is an important step when creating a large-scale knowledge base [1,8] which has many downstream applications such as knowledge-aware question answering [27] and recommendation systems [28,35].
Classical methods for WIE, like Wrapper Induction [7,39,46], rely on the publicly available source code of websites.The code is commonly parsed into a document object model (DOM) tree.The DOM tree is a programming language independent tree representation of any website, which contains all its elements.It can be obtained using various libraries like Puppeteer 2 .These elements contain infor-mation about their location in the rendered webpage, styling like font size, etc., and text if it is a leaf node.Various developer tools have been developed to inspect the DOM tree and modify it manually.
However, using only the DOM tree for WIE is increasingly challenging for a variety of reasons: 1) Webpages are programmed to be aesthetically pleasing; 2) Oftentimes content and style is separated in website code and hence the DOM tree; 3) The same visual result can be obtained in a plethora of ways; 4) Branding banners and advertisements are interspersed with information of interest.For instance, the WIE, WHISK [46] which learns extraction rules for number of bedrooms in a hotel fails if the style, content or tag of web element containing "bedroom" is changed.
For this reason, more recently, WIE applied optical character recognition (OCR) on rendered websites followed by word embedding-based natural language extraction [47].However, as mentioned before, recent webpages are highly enriched with visual content, and classical word embeddings don't capture this contextual information [53].For instance, text in advertising banners may be interpreted as valuable information.For this reason, a simple OCR detection followed by natural language processing techniques is a suboptimal WIE [53].
In response to these challenges we develop WIE based on a visual representation of a web element and its context.This permits to address the aforementioned four challenges.Moreover, visual features are independent of the programming language (e.g., HTML for webpages, Dart for Android or iOS apps) and partially also the website language (e.g., Arabic, Chinese, English).Intuitively, we aim to mimic the ability of humans to detect the location of target elements like product price, product title and product image on a webpage in a foreign language like the one shown in Fig. 1.
For this, we develop a context-aware Webpage Object Detection (WOD), which we refer to as Context-aware Visual Attention-based detection (CoVA), where entities like prices are objects.Somewhat differently from an object in  natural images which can be detected largely based on its appearance, objects on a webpage are strongly defined by contextual information.E.g., a cat's appearance is largely independent of its nearby objects, whereas a product price is a highly ambiguous object (Fig. 2).It refers to the price of a product only when it is contextually related to a product title and a product image.The developed WOD uses a graph attention [29] based architecture, which leverages the underlying syntactic DOM tree [58] to focus on important context [59] while classifying an element on a webpage.
To facilitate this task we create a dataset of 7.7k product webpage screenshots along with DOM information spanning 408 different websites (domains).We create a crossdomain split so as to train on some domains (e.g., Amazon, Etsy), and evaluate on others (e.g., eBay).We compare the results of CoVA with existing and newly created baselines that take visual features into account.For this we use accuracy of each class as the evaluation metric and show that CoVA leads to substantial improvements while yielding interpretable contextual representations (see Sec. 5).In summary, we make the following contributions: 1. We formulate WIE as a context-aware WOD problem.
2. We develop a Context-aware Visual Attention-based (CoVA) detection pipeline, which is end-to-end trainable and exploits syntactic structure from the DOM tree along with screenshot images.CoVA uses a variant of Fast R-CNN [13] to obtain a visual representation and graph attention [52] for contextual learning on a graph constructed from the DOM tree.CoVA improves recent state-of-the-art baselines by a significant margin.3. We create the largest public dataset of 7.7k product webpage screenshots from 408 online retailers for Object Detection from product webpages.Our dataset is ∼ 10× larger than existing datasets.4. We show the interpretability of CoVA using attention visualizations (Sec.6.5)

Related Work
Webpage information extraction (WIE) has been mainly addressed with Wrapper Induction (WI).WI aims to learn a set of extraction rules from HTML code or text, using manually labeled examples and counter-examples [7,39,46].These often require human intervention which is timeconsuming, error-prone [50], and does not generalize to new templates.
Supervised learning, which treats WIE as a classification task has also garnered significant attention.Structural and semantic features [12,22,55] are obtained for each part of a webpage to predict categories like title, author, etc. Wu et al. [54] cast WIE as a HTML node selection problem using features such as positions, areas, fonts, text, tags, and links.Joshi and Liu [24] develop a semantic similarity between blocks of webpages using textual and DOM features to extract the key article on a webpage.Rastogi et al. [41] extract visual cues which are followed by learning the relationship between elements.This information is utilized to allocate a document into predefined templates for which rules for target detection are learned using training data.Lin et al. [28] proposes a neural network to learn representation of a DOM node by combining text and markup information.[21] develops a transformer architecture to learn spatial dependency between DOM nodes.Unlike these work which depends on text information, we aim to learn representation of a DOM node using only visual cues.
Visual features have been extensively employed to generate visual wrappers for pattern extraction.Mostly, these utilize hand-crafted visual features from a webpage, e.g., area size, font size, and type [4].Cai et al. [5] develop a visual block tree of a webpage using visual and layout features along with the DOM tree information.Subse-quent works use this tree for tasks like webpage segmentation, visual wrapper generation, and web record extraction [3,6,30,45].Zheng et al. [57] develop a supervised learning framework using predefined visual features to extract content from news webpages.Gogar et al. [15] aim to develop domain-specific wrappers which generalize across unseen templates and don't need manual intervention.They develop a unified model that encodes visual features, textual features, and positional features using a single Convolutional Neural Net (CNN).
Object detection (OD) which aims to detect and classify all objects, has been extensively studied for natural images.Deep learning-based methods such as YOLO [42], R-CNN variants [13,14,18,43], SSD [31], etc. yielded state-of-theart results in OD.OD methods that can capture contextual information are of particular interest here.Divvala et al. [11] learn contextual information for presence, size and location of other objects.Perko and Leonardis [40] learn a context confidence score for each class by estimating the importance of each pixel.Murphy et al. [38] learn local and global context by object presence and localization and use a product of experts model [19] to combine them.Torralba and Sinha [48] learn global context information in terms of the spatial layout of spectral components.Cinbis and Sclaroff [9] learn the set of descriptors for other objects in the scene and learn intra-class and inter-class spatial relations.
Graph Convolutional Networks (GCN) [25] was proposed to learn a node representation while taking neighbors of a node into account.Using it, Liu et al. [32] represent a visually rich document as a complete graph of text content obtained by passing OCR [36].They employ GCN to learn node representations for each web element.
Recently, attention mechanisms have also shown remarkable ability in capturing contextual information [2].Vaswani et al. [51] propose a transformer architecture for language modeling.Word vectors learned on BERT [10], which use self-attention, have yielded state-of-the-art results on 11 NLP tasks.Luo et al. [34] use attention over a BiLSTM-CRF layer for Named Entity Recognition (NER) on biomedical data.Separately, attention has been used for contextual learning in OD [20,26,37] and image captioning [56].Attention mechanisms have also been employed over graphs to learn an optimal representation of nodes while taking graph structure into account [52].
Moreover, attention permits to interpret result, which is often desired in many applications.We show our visualizations depicting this advantage below (Sec.6.5).

Problem formulation
The DOM tree captures the syntactical structure of a webpage similar to a parse tree of a natural language.Our goal is to extract semantic information exploiting this syn- tactic structure.We view a leaf web element as a word and the webpage as a document with the DOM tree as its underlying parse tree.Formally, we represent a webpage W as the set W = {v 1 , v 2 , . . ., v i , . . ., v N , D} where v i denotes the visual representation of the i-th web element, N denotes number of web elements, and D refers to the DOM tree which contains the relations between the web elements.Our goal is to learn a parametric function f θ (y i |W, i) which extracts a visual representation v i of the i-th web element from website W so as to accurately predict label y i of the web element.In the following we consider four labels, i.e., y i ∈ {product price, title, image, background}.The parameters θ are obtained by minimizing the following supervised classification loss , where E denotes an expectation, y i and y * i denote the predicted and ground truth labels and P W denotes a probability distribution over webpages.
Information of a webpage is present in the leaves of the DOM tree, i.e., the web elements i. Web elements are an atomic entity which is characterized by a rectangular bounding box.We can extract the target information y i from the DOM tree if we know the exact leaf bounding boxes of the desired element.Therefore, we can view WIE as an object detection (OD) task where objects are leaf elements.However, identity y i of a web element is heavily dependent on its context, e.g., price, title, and image of a product are most likely to be in same or nearby sub-tree in comparison to unrelated web elements such as advertisements.Similarly, there can be multiple instances of price-like elements.
However, the correct price would be contextually positioned with product title and image (Fig. 2).Therefore, we formulate WIE as a context-aware object detection.
We use the DOM tree to identify context for a web element.We represent the syntactic closeness between web elements through edges in the graph (discussed in next section).We then employ a graph attention mechanism [52] to attend to the most important contexts.

Proposed End-to-End Pipeline -CoVA
In this section, we present our Context-Aware Visual Attention-based end-to-end pipeline for Webpage Object Detection (CoVA) which aims to learn function f to predict labels y = [y 1 , y 2 , . . ., y N ] for a webpage.The input to CoVA consists of 1. a screenshot of a webpage, 2. list of bounding boxes [x, y, w, h] of the web elements, and 3. neighborhood information for each element obtained from DOM.
As illustrated in Fig. 3 this information is processed by the developed CoVA in four stages: 1. the graph representation extraction for the webpage, 2. the Representation Network (RN), 3. the Graph Attention Network (GAT), and 4. a fully connected (FC) layer.The graph representation extraction computes for every web element i its set of neighboring web elements N i .The RN consists of a Convolutional Neural Net (CNN) and a positional encoder aimed to learn a visual representation v i for each web element i ∈ {1, . . ., N }.The GAT combines the visual representation v i of the web element i to be classified and those of its neighbors, i.e., v k ∀k ∈ N i to compute the contextual representation c i for web element i.Finally, the visual and contextual representations of the web element are concatenated and passed through the FC layer to obtain the classification output.We describe each of the components next.

Webpage as a Graph
As discussed earlier, the identity of a web element depends on its context.Therefore, we represent a webpage as a graph where nodes are leaf web elements and an edge indicates that the corresponding web elements are contextually relevant to each other.The graph representation of a webpage permits to learn a contextual representation of a web element by identifying important context, e.g., the currency symbol near a price.An edge within the graph denotes the syntactic closeness in the DOM tree.Specifically, we use the K nearest leaf elements in the DOM tree as the neighbors N i of a web element i.

Representation Network (RN)
The goal of the Representation Network (RN) is to learn a fixed size visual representation v i of any web element i ∈ {1, . . ., N }.This is important since web elements have different sizes, aspect ratios, and content type (image or text).To achieve this the RN consists of a CNN operating on the screenshot of a webpage, followed by a Region of Interest (RoI) pooling layer [13] and a positional encoder.Specifically, RoI pooling is performed to obtain a fixed size representation for all web elements.To capture the spatial layout, we learn a P dimensional positional feature which is obtained by passing the bounding box features [x, y, w, h, w h ] through a positional encoder implemented by a single layer neural net.Finally, we concatenate the flattened output of the RoI pooling with positional features to obtain the visual representation v i .

Graph Attention Network (GAT)
The goal of the graph attention network is to compute a contextual representation c i for each web element i which takes information from neighboring web elements into account.However, out of multiple neighbors for a web element, only a few are informative, e.g., a web element having a currency symbol near a set of digits seems relevant.To identify the relational importance we use a Graph Attention Network (GAT) [52].It takes the visual representations v i of a web element and its neighbors, and computes the contextual representation c i .Formally, let v = [v 1 , v 2 , . . ., v N ] represent the visual representations of web elements obtained from the RN.We transform each of the input features by learning projection matrices W 1 and W 2 applied at every node and its neighbors.We then employ self-attention [29] to compute the importance score, where • T represents transposition, || is the concatenation operation, N i denotes the neighbors of web element i.The weights α ij are non-negative attention scores for neighboring web elements of web element i.Finally, we obtain the contextual representation c i for a web element i as a weighted combination of projected visual representations of its neighbors, i.e., via

Augmenting CoVA with extra features
In scenarios where additional features (e.g., text content, HTML tag information, etc.) are available, CoVA can be easily extended to incorporate those.These features can be concatenated with visual representations obtained from the RN without modifying the model in any other way.We refer to this extended model as CoVA++.However, making the model dependent on these features might lead to constraints regarding the programming language (HTML tags) or text language.In Sec.6.4, we show that CoVA trained on English webpages (without additional features) generalizes well to Chinese webpages.This result suggests that CoVA is able to learn visual representations that are generalizable.

Dataset Generation
To the best of our knowledge there is no large-scale dataset for WIE with visual annotations for object detection.So far, the Structured Web Data Extraction (SWDE) dataset [16] is the only known large dataset that can be used for training deep neural networks for WIE [28,33].SWDE dataset contains webpage HTML codes which is not sufficient to render it into a screenshot (since it contains links to old and non-existent URLs).Because of this we create a new large-scale labeled dataset for object detection on product webpage screenshots.We chose e-commerce websites since those have been a de-facto standard for WIE [15,59].Our dataset generation consists of two steps: 1. search the web with 'shopping' keywords to aggregate diverse webpages and employ heuristics to automate labeling of product price, title, and image, 2. manual correction of incorrect labels.We discuss both steps next.Web scraping and coarse labeling.To scrape websites we use Google shopping 3 which aggregates links to multiple online retailers (domains) for the same product.These links are uploaded by the merchants of the respective domains.We do a keyword search for various categories, like 'electronics,' 'food,' 'cosmetics.'For each search result, we record the price and title from Google shopping.Then, we navigate through the links to specific product websites and save a 1280 × 1280 screenshot.To extract a bounding box for each web element, we store a pruned DOM tree.Price and title candidates are labeled by comparing with the recorded values using heuristics.For product images, we always choose the DOM element having the largest bounding box area among all the elements with an <img> HTML tag, although this might not be true for many websites.We correct this issue in the next step.Label correction.The coarse labeling is only ∼60% accurate because 1. price on webpages keeps changing and might differ from the Google shopping price, and 2. many bounding boxes have the same content.To correct for these mistakes, we manually inspected and correct labeling errors.In many cases, product price or product title is present multiple times in a webpage, so we made our best effort to choose the best one given its context.We obtain 7,740 webpages spanning 408 domains.Each of these webpages contains exactly one labeled price, title, and image.All other web elements are labeled as background.On average, there are ∼90 web elements on a webpage.Train-Val-Test split.We create a cross-domain split which ensures that each of the train, val and test sets contains webpages from different domains.Specifically, we construct a 3 : 1 : 1 split based on the number of distinct domains.We observed that the top-5 domains (based on number of samples) were Amazon, EBay, Walmart, Etsy, and Target.So, we created 5 different splits for 5-Fold Cross Validation such that each of the major domains is present in one of the 5 splits for test data.

Experimental Setup & Results
In this section, we present our experimental setup, evaluation metrics, comparison of results with the baselines, and attention visualizations of our model.

Baseline Methods
We compare the results of our end-to-end pipeline CoVA with other existing and newly created baselines summarized below.Our newly created baselines combine existing object detection and graph based models to identify the importance of visual features and contextual representations.
[15]: This method identifies product price, title, and image from the visual and textual representation of the web elements.We use their publicly available code to train it on our dataset.Random Forest on Heuristic features: We train a Random Forest classifier with 100 trees using various heuristic HTML tag-based, text, and bounding box features.HTML tags like <H1>, <P>, <IMG>, etc. are one-hot encoded.Textual features include font size, number of words, and binary features like presence of currency symbols, text, and number.Bounding box features [x, y, w, h, w h ] for the web element are also used.
Fast R-CNN*: We compare with Fast R-CNN [13] to quantify the importance of contextual representations in CoVA.We use the DOM tree instead of selective search [49] for bounding box proposals.Since the proposals are exactly localized on the webpage, there is no need for bounding box regression.We also use positional features as described when discussing the representation network (Sec.4.2) for a fair comparison with CoVA.We will refer to this baseline as 'Fast R-CNN*.' Fast R-CNN* + GCN [25]: We use graph convolution networks on our graph formulation where node features are the visual representations obtained from Fast R-CNN*.Fast R-CNN* + Bi-LSTM [44]: We train a bidirectional LSTM on visual representations of web elements in preorder traversal of the DOM tree.We use its output as the contextual representation and concatenate it with the visual representation of the web element obtained from Fast R-CNN*.

Model Training, Inference and Evaluation
In each training epoch, we randomly sample 90% from background (neither of price, title, or image) web elements.This increases the diversity in training data by providing different contexts for webpages with exactly the same template.Overall, it introduces stochasticity and reduces overfitting for contextual learning while decreasing the number of computations.We use batch normalization [23] between consecutive layers which improves convergence and final performance.We train the model for a maximum of 50 epochs with early stopping, after which we restore model parameters to the epoch corresponding to the best validation data result.We use the Adam optimizer for updating model parameters and minimize cross-entropy loss.During inference, the model detects one web element with highest probability for each class.Once the web element is identified, the corresponding text content can be extracted from the DOM tree or by using OCR for downstream tasks.
For CoVA++ we use as additional information the same heuristic features used to train the Random Forest classifier baseline.Unless specified otherwise, all results of CoVA and baselines use the following hyperparameters where applicable: learning rate = 5e-4, batch size = 5 screenshot images, K = 24 neighbor elements in the graph, RoI pool output size (H × W ) = (3 × 3), dropout = 0.2, P = 32 dimensional positional features, output dimension for projection matrix W 1 , W 2 is 384, weight decay = 1e-3.We use the first 5 layers of a pre-trained ResNet18 [17] in the representation network (RN), which yields a 64 channel feature map.This significantly reduces the parameters in the RN from 12m to 0.2m and speeds up training at the same time.The evaluation is performed using Cross-domain Accuracy for each class, i.e., the fraction of webpages of new domains with correct class.Tesla V100-SXM2-16GB GPUs.

Results
As shown in Table 1, our method outperforms all baselines by a considerable margin especially for price prediction.CoVA learns visual features which are significantly better than the heuristic feature baseline that uses predefined tag, textual and visual features.Fig. 4 shows the importance of different heuristic based features in a webpage.We observe that a heuristic feature based method has similar performance to methods which don't use contextual features.Moreover, CoVA++ which also uses heuristic features, doesn't lead to significant improvements.This shows that visual features learnt by CoVA are more general for tasks like price & title detection.Context information is particularly important for price (in comparison to title and image) since it's highly ambiguous and occurs in different locations with varying contexts (Fig. 2).This is evident from the ∼8.9% improvement in price accuracy compared to the Fast R-CNN*.Unless stated otherwise, we will discuss results with respect to price accuracy.We observe that CoVA yields stable results across folds (∼3.5% reduction in standard deviation).This shows that CoVA learns features which are generalizable and which have less dependence on the training data.Using GCN with Fast R-CNN* leads to unstable results with 11% standard deviation while yielding a 3.4% improvement over Fast R-CNN*.Fast R-CNN* with Bi-LSTM is able to summarize the contextual features by yielding a ∼6.3% improvement in comparison to Fast RCNN*.CoVA outperforms Fast RCNN* with Bi-LSTM by ∼2.6% with much fewer number of parameters while also yielding interpretable results.We also obtained top-3 accuracy for CoVA, which are 98.6%, 99.4%, and 99.9% for price, title and image respectively.

Cross-lingual Evaluation of CoVA
To validate our claim that visual features (without textual or HTML tag information) can capture cross-lingual information, we test our model on webpages in a foreign language.In particular, we evaluated CoVA (trained on English product webpages) using 100 Chinese product webpages spanning across 25 unique domains.CoVA achieves 92%, 90%, and 99% accuracy for product price, title, and image.

Attention Visualizations
Table 1 shows that attention significantly improves performance for all the three targets.As discussed earlier, only few of the contexts are important which are effectively learnt by our Graph Attention Network (GAT).We observed that on average, ∼20% of context elements were activated (score above 0.05 threshold) by GAT.We also study a multihead attention instead of single head following [51], which didn't yield significant improvements in our case.
Fig. 5 shows visualizations of attention scores learnt by GAT.Fig. 5(a) shows an example where title and image have more weight than other contexts when learning a context representation for price.This shows that attention is able to focus on important web elements and discards others.Similarly, Fig. 5(b) shows that price has a much higher score than other contexts for learning contextual representation for title.We found that there are some cases where attention gives similar weights to all contexts.

Ablation Studies
Importance of Positional features.We train CoVA without positional features to gauge its importance.Table 2 shows that positional features can significantly improve accuracy for price, title, and image prediction.This also val-   idates that for webpage object detection, location and size of a bounding box carries significant information, making it different from classical object detection.
Dependence on number of Neighbors in Graph.Fig. 6 shows the variation in cross domain accuracy of CoVA with respect to the number of neighboring elements K.Note that having 0 context elements is equivalent to our baseline Fast R-CNN*.We observe that, unlike for title and image, price accuracy can significantly be improved by considering larger contexts.This is due to the fact that price is highly ambiguous (Fig. 2).We also study the graph construction described by [32] where all nodes are considered in the neighborhood of a particular node.This significantly reduced the performance for price (90.7%) and title (92.7%).

Importance of Sampling in training.
As discussed in Section 6, we introduce a random sampling of 90% background  3 shows that this leads to improvements in results.

Conclusion & Future Work
In this paper, we reformulated the problem of webpage IE (WIE) as a context-aware webpage object detection.We created a large-scale dataset for this task, which we will release publicly.We proposed CoVA which uses i) a graph representation of a webpage, ii) a Representation Network (RN) to learn visual representation for a web element, and iii) a Graph Attention Network (GAT) for contextual learning.CoVA improves upon state-of-the-art results and newly created baselines by considerable margins.Our visualizations show that CoVA is able to attend to the most important contexts.In the future, we would like to adapt this method to other tasks such as identifying malicious web elements.

Figure 1 .
Figure 1.A person can detect the web element for product price, title, and image, even without knowing (a) Arabic or (b) Chinese

Figure 2 .
Figure 2. Example webpage showing multiple possible prices (red), but relatively fewer possible title (green) or image (purple)

Figure 3 .
Figure 3. CoVA end-to-end training pipeline (for a single web element).CoVA takes a webpage screenshot and list of bounding boxes along with K neighbors for each web element (obtained from DOM).RN learns visual representation (v0) while GAT learns contextual representation (c0) from its neighbor's visual representations.v0 and c0 are concatenated and passed through FC layer.

Figure 4 .
Figure 4. Gini impurity-based importance of features in RF

Figure 5 . 2 ± 10
Figure 5. Attention Visualizations where red border denotes web element to be classified, and its contexts have green shade whose intensity denotes score.Price in (a) get much more score than other contexts.Title and image in (b) are scored higher than other contexts for price.Method Price Accuracy Title Accuracy Image Accuracy CoVA without positional features 89.2 ± 10.3 91.9 ± 1.4 95.9 ± 1.8 CoVA 95.5 ± 3.8 95.7 ± 1.2 98.8 ± 1.5

Figure 6 .
Figure 6.Comparison of context size with accuracy

Table 1 .
All the experiments are performed on Method No. of parameters Price Accuracy Title Accuracy Image Accuracy Gogar et al.Cross Domain Accuracy (mean ± standard deviation) for 5-fold cross validation.

Table 3 .
Improvement in performance due to sampling