WebDP: Understanding Discourse Structures in Semi-Structured Web Documents

.


Introduction
With the rapid development of internet during past decades, web documents have become one of the most primary and the biggest data resources in current era.As a result, understanding their discourse structure -how different components in a document semantically interact with each other to form a cognitive entirety -will greatly benefit many downstream applications, as previous works have demonstrated the virtue of structure information in other types of documents (Chen et al., 2020;Geva and Berant, 2018;Xing et al., 2022;Frermann and Klementiev, 2019;Zhang et al., 2020).
The free-styled, semi-structured nature of web documents gives them characteristics different from traditional forms of documents, providing abundant opportunities and challenges for discourse research.On one hand, web documents exhibit more free-styled discourse organization.For instance, it is common for web documents to encompass multiple topics (Tsujimoto and Asada, 1990), where different blocks 2 within the document are loosely connected by implicit semantic relevance or even describe independent topics (Figure 1).The free-styled nature allows discourse with loose structures, multiple topics and weak coherence, in contrast to the compactly-structured, single-topic and strongly-coherent nature of traditional documents.On the other hand, web documents are semi-structured in respect of their HTML markup language and layout of blocks.The content of web documents tend to be organized under multiple hierarchies, usually reflected by the HTML markup hierarchies (Shinzato and Torisawa, 2004;Yoshida and Nakagawa, 2005).However, the semistructured information provided by HTML markup and layout structures is merely superficial, since it does not consistently align with the underlying semantic relations in the discourse structure (Figure 1), to which, analogous phenomenon has also been found in plain texts (van der Vliet and Redeker, 2011).In fact, the inconsistency between 2 "Blocks" (Tsujimoto and Asada, 1990), or "physical objects" (Cao et al., 2022;Mao et al., 2003) more formally, are customary concepts used in previous document image processing research.Tsujimoto and Asada (1990) defined "blocks" as "a set of text lines with the same typeface font and a constant line interval" within a document.In this paper, we study discourse connections among "blocks" in web documents and use this term as a short name for Elementary Discourse Units in the proposed web document discourse schema.A formal definition of these units is presented in § 3.4.these superficial structures and the underlying discourse structure is prevalent in web documents due to their free-styled writing process.
Unfortunately, previous research can not fully fulfill the requirements of discourse representation proposed by the free-styled, semi-structured characteristics of web documents.This limitation can be attributed to two distinct lines of research: Discourse Analysis (Li et al., 2022a) and Document Intelligence (Cui et al., 2021).As for discourse analysis research, traditional studies focus on classical plain texts with compact, single-topic and strong-coherent discourses, leaving a gap to freestyled web documents with loose discourse organization, multiple topics and weak coherence.Furthermore, they mainly study clause-level discourse phenomena, rather than blocks in semi-structured web documents.As for document intelligence research, although they recognize various kinds of semi-structured and multi-modal information carried in documents, including visual, layout, and HTML markup features, most studies only examine superficial structures directly derived from these multi-modal features.As demonstrated in Figure 1, the inconsistency between superficial structures and underlying discourse structure in web documents impairs the power of this line of works in downstream applications that require a deep understanding of semantics.
In this paper, we introduce a new benchmark WebDP, based on a well-designed discourse representation schema for web documents named WebDP structure.The aim is to promote studies on web document discourse analysis and fill the gap between underlying discourse structure and superficial layout structure in documents.To accurately represent the free-styled discourse structure of web document, we design the discourse schema by extending previous linguistic theories of Rhetorical Relations (Mann and Thompson, 1988) and SDRT (Lascarides and Asher, 2008) to take into account characteristics of discourse organization found in web documents.Specifically, WebDP structure adopts two kinds of rhetorical relations to connect blocks in document and form a hierarchical structured discourse representation: subordinating relations depict semantic functions between blocks across different levels of the hierarchy and coordinating relations depict coherence properties between blocks at the same-level hierarchy.Then, we collect web documents and manually annotate a dataset WEBDOCS to support further research of WebDP.Additionally, to verify the feasibility of the proposed task and reveal its challenges to current methods, we re-implement representative neural models from related fields and conduct systematic experimental analyses on the dataset.
Contribution of this work can be concluded as: • We formulated a web document discourse schema to explicitly model discourse structures within web documents ( § 3).

Related Work
Previous researches approach the structures within documents from two orthogonal research lines, i.e., discourse analysis (Li et al., 2022a) and document intelligence (Cui et al., 2021).Our proposed schema, WebDP structure, makes good complement to them on modeling the discourse of web documents (See Table 1 for a brief summary).

Discourse Analysis
Traditional research on discourse analysis have made comprehensive studies into discourse structure representation based on various kinds of discourse theories (Fu, 2022).Flat (Prasad et al., 2008;Zhou and Xue, 2012;Xue et al., 2015), tree (Carlson et al., 2001;Li et al., 2014;Yoshida et al., 2014;Jiang et al., 2018) and graph (Wolf and Gibson, 2005) structures had been exploited in discourse schemas.Despite their comprehensive discussion on theoretical foundations, traditional studies are mainly confined to short, single-topic, plain text and model discourse relations at clause level, therefore can not be directly borrowed as solutions to modeling long, multiple-topic, semi-structured web documents at block level.
There are also researches pay attention to genrespecific discourse structure.For example, Dialogue Discourse Parsing (Afantenos et al., 2015;Asher et al., 2016;Li et al., 2020) for modeling discourse structures of multi-turn conversation; discourse structure of technical web forums (Wang et al., 2011), news articles (Choubey et al., 2020) and long-form answers (Xu et al., 2022).In this paper, we take a similar motivation to formulate the discourse structure in web documents by modeling their genre-specific characteristic.
Another line closely related to discourse analysis is Text Segmentation (Choi, 2000;Purver, 2011), which aims to segment long text into shorter segments with inner topic coherence.While text segmentation can be seen as parsing a shallow, linear structure, our proposed schema can provide more abundant semantic representation capability and well account for transition of topic at various hierarchical levels as suggested by Eisenstein (2009).

Document Intelligence
Document Logical Hierarchy Extraction (or Document Logical Structure Analysis) (Tsujimoto and Asada, 1990;Summers, 1998;Mao et al., 2003;Pembe and Güngör, 2015;Manabe and Tajima, 2015;Rahman and Finin, 2017;Cao et al., 2022) and Table of Content Extraction (Maarouf et al., 2021) aims at parsing document to produce a hierarchical structure based on layout relationship between physical blocks.As we mentioned before, the layout relationship in web documents does not imply consistent semantic relation types.Therefore, the underlying discourse structure of web documents can not be faithfully represented by these task settings.Wang et al. (2020) proposed Form Understanding which models latent hierarchy in forms by a single "key-value" relation.Similarly, Hwang et al. ( 2021) models semi-structured document images information extraction as Spatial Dependency Parsing between single tokens.Different from their ideas dedicated to forms and other short document images, our work aims at modeling discourse structure of whole, long document and describes richer categories of semantic relations.

Designing Discourse Representation Schema for Web Documents
In this section, we describe the proposed web document discourse schema -WebDP structure.We first summarize main characteristics of web documents and the consequent requirements for a welldesigned discourse schema ( § 3.1).Then we briefly review Rhetorical Relations (Mann and Thompson, 1988) and SDRT (Lascarides and Asher, 2008) theories and discuss their merits on modeling web document discourse ( § 3.2).Finally, we extend the above theories by carefully summarizing semantic relation labels special to web documents and propose the new web document discourse schema ( § 3.3 and § 3.4).

Characteristics of web document discourse structure
As a special genre of text, current web documents have free-styled discourse organization and semistructured data format, which bring them unique properties and make traditional task settings no direct solution to their discourse structure.Specifically, we mainly focus on two prominent characteristics caused by the nature of web documents: multiple hierarchies caused by semi-structured data format and multiple topics caused by free-styled discourse organization, which are carefully summarized through a preliminary case study on web document instances, and we use them to calibrate a well-designed discourse schema.
Multiple Hierarchies.Documents are intentionally designed to have various levels of information packaging, where some blocks subordinate others semantically or pragmatically to form hierarchical structures.Although such hierarchies of semantics are usually indicated by semi-structured layout and markup features (Power et al., 2003), unfortunately, it is common to have inconsistency between these superficial structures and underlying semantic hierarchy.Therefore, to accurately represent discourse structure of web documents, a schema should well account for the hierarchical structure to discriminate different underlying semantic functions between blocks in the hierarchy.
Multiple Topics.It is common for web documents to contain multiple topics of content which are semantically incoherent with each other (Tsujimoto and Asada, 1990).Therefore, besides the hierarchical structure realized by various kinds of semantic functions between different hierarchy levels, a discourse schema for web documents should also explicitly model the topic transition and semantic incoherence between blocks on the same hierarchy level.Furthermore, due to the free-styled and usergenerated writing process, web documents could be noise-intensive.To ensure coverage of a wide range of web document instances and robustness against noise, the discourse schema should strike a balance between expressive capability and universality, necessitating conciseness.
To meet these requirements, we refer to classical discourse linguistic theories of Rhetorical Relations (RR, Mann and Thompson, 1988; see Jasinskaja and Karagjosova, 2015 for an integrated review) and Segmented Discourse Representation Theory (SDRT, Lascarides and Asher, 2008) to borrow intuition.In the following, we will first briefly review key points of these theories which we grounded to.Then we formulate WebDP structure based on them.

Rhetorical Relation Theory
Mann and Thompson (1988) firstly suggested that coherence of text is realized by some function that connect each different parts of it.They called these function Rhetorical Relations (RRs).Mann and Thompson (1988) suggested the list of RRs is po-tentially open to addition of new relations, for the sake of describing discourse structure of particular texts.We assume that different blocks in a web document are also connected by some RRs to realize the coherence of document when being read.
Further, two kinds of RRs have been discriminated (Asher and Vieu, 2005), i.e., Subordinating RRs and Coordinating RRs. 1) Subordinating RRs, such as Elaboration and Explanation, exist between units with unequal information packaging levels, where one is subordinate to the other.2) Coordinating RRs, such as Narration, Parallel and Contrast, exist between units of the same information packaging level.SDRT theory (Asher, 1993;Asher and Lascarides, 2003;Lascarides and Asher, 2008) adopt the distinction of two types of RRs and consider they govern the hierarchical structure of discourses.
The distinction of subordinating and coordinating RRs provides suitable theoretical foundation for modeling discourse structure of web documents.On one hand, various types of semantic relations between multiple hierarchies can be modeled by subordinating RRs which describe dominating relations between units on unequal information packaging levels.On the other hand, multiple topics characteristic and different semantic relations between same-level unit can be modeled by coordinating RRs.Therefore, we decide to design a web document discourse schema based on RR theory.

Overview of WebDP Structure
Based on RR theory, we propose the new discourse schema for modeling web document discourse structure.Specifically, WebDP structure is composed with elementary discourse units of document (i.e., blocks in web document) and binary rhetorical relations between them.Two kinds of rhetorical relations are considered in the structure: 1) Subordinating relations, which can be analogous to parent-child relation in a tree, and 2) Coordinating relations, which can be analogous to relation between successive sibling nodes in a tree.Note that subordinating relations in the proposed discourse structure are not necessarily consistent with the layout structure in web documents.
We can denote the WebDP structure as G (d) = {V (d) , E (d) }, a sparse graph structure consisting node set V (d) and edge set E (d) .Each edge r k in E (d) indicates a binary relation between a pair of nodes, attributed by its rhetorical relation label.We further add some constraints on the resulting structures to restrict its complexity. 1) subordinating edges are directed since subordinating relations are semantically asymmetric.We define the direction of subordinating edges as from subordinated (lower-level in hierarchy) nodes to dominating (upper-level in hierarchy) nodes and stipulate that each node has at most 1 outgoing subordinating edge while the number of incoming subordinating edges is unconstrained.2) coordinating edges are undirected since coordinating relations are symmetric.We stipulate that each node can be linked by at most 2 coordinating edges and there are no cycle formulated by coordinating edges themselves.
In the following, we instantiate the schema by specifying the constitution of node set and rhetorical relations in two kinds of edge subsets.

Elements in WebDP Structure
Nodes in Discourse Schema.Nodes are elementary discourse units (EDUs) in discourse structure when represented as a graph.While traditional discourse parsing usually take clauses as elementary units, we follow document logical hierarchy extraction studies (Cao et al., 2022;Tsujimoto and Asada, 1990;Mao et al., 2003) to model semantic relations between physical objects or blocks in document.Tsujimoto and Asada (1990) defined blocks as "a set of text lines with the same typeface font and a constant line interval" within a document.Summers (1998) suggested that segments of document in logical structure should be "visually distinguished semantic component" in the document, emphasizing both layout requirement and content requirement.
Following these previous concepts, we formalize "blocks" in WebDP structure as non-overlapping physical objects (a region with a bounding box) in documents with two criterion: 1) Layout homogeneity, requiring blocks should be visually distinguishable and have clear layout boundaries segmenting them from other parts of the document.2) Semantic Coherence, requiring blocks should have internal semantic coherence in their content.In practice, "blocks" in web document data are usually paragraph-level units.For example, heading, paragraph, items in bullet lists, caption of figures are several common kinds.Blocks serve as EDUs and constitute the node set V (d) in our web document discourse schema.
Subordinating Relation Set.We define 7 types of subordinating relations, namely ELABORATION, EXPLANATION, TOPIC&TAG, ATTRIBUTE, LIT-ERATURE, CAPTION and PLACEHOLDER.While the first four relation types also appear in previous discourse schema, we include three additional subordinating relations based on data-driven analysis.These additions aim to address specific cases unique to web documents that cannot be adequately represented by the previously defined relations.Also to be noticed is that in our schema, the subordinating EDUs are blocks with more concise and succinct information, occupying higher information packaging level, which is slightly different from the definition of "nucleus" (EDUs with more significant information) in RST-styled discourse schema.Detailed interpretations for each type of relation can be found in appendix A.1.
Coordinating Relation Set.We define 5 types of coordinating relations, namely NARRATION, LIST, PARALLEL&CONTRAST, TOPIC_CORRELATION and BREAK.These coordinating relations are designed to model various degrees of coherence within long documents between EDUs inside the same information packaging level, forming a spectrum from tight to loose semantic coherence.Among them, LIST is a common discourse expression pattern in documents compared with plain texts; and BREAK is of high frequency in freestyled web documents.Detailed interpretations for each type of relation can be found in appendix A.2.

WebDP: A new benchmark for Web Document Discourse Parsing
To facilitate web document discourse analysis research, we present a new task named Web document Discourse Parsing (WebDP).The goal of the task is to automatically convert linear list of EDUs in the input web document into the hierarchical discourse structure defined in WebDP structure.
To benchmark the task, we construct a new dataset WEBDOCS and introduce its evaluation metrics.In virtue of WEBDOCS, we can benchmark further studies on WebDP, reveal the key challenges and difficulties on web document discourse analysis, assess the effectiveness and diagnose the defect of different parsing algorithms.

Data Collection
A well-designed dataset should widely cover main characteristics of web documents we analysed before.With this consideration, we choose WeChat Official Account3 as data source based on the following reasons.First, web documents on WeChat Official Account are highly free-styled as they are generated by a multitude of individual authors who contribute content independently, rather than adhering to a standardized editing norm.Specifically, we find these web documents have salient multiple topics and multiple hierarchies phenomena and are also enriched with inconsistency between superficial visual structure and underlying discourse structure.Second, WeChat Official Account has a broad and active user community.This makes a dataset based on it universal in domain, extensible in scale and have practical values.
To get the content of each EDU in web documents, HTML source codes are crawled from web sites and we use a naive rule-based parsing script to extract textual content along with their XPath information for each HTML element.

Data Annotation
In the annotation stage, two human annotators are employed to give golden WebDP discourse structure annotations to web documents.Both annotators are native Chinese (the same language of the data) speakers and hold bachelor's degrees in education.We recruited them through advertising on an internal institution forum.After that, according to the number of applicants, we set a target of 300 web documents in total where each annotator assigned 180 web documents, leaving 60 randomly sampled documents to be doubly-annotated to measure annotation consistency (Human Baseline in Table 2).
In the formal annotation phase, annotators are asked to label WebDP structures for web documents following 3 steps: 1) Annotate EDUs.To annotate EDUs by identifying blocks of text and aligning their content with XPath information4 .All EDUs within a web document were organized in a list V (d) = {e 1 , ...e |d| } and arranged in correct reading order.2) Annotate subordinating relations.To iterate over V (d) and consider each EDU.
For the EDU e i being considered, annotators assigned it the most appropriate subordinating EDU before it (e j<i ) based on the available subordinating relation set.If a suitable subordinating EDU could not be found, a dummy node was marked.3) Annotate coordinating relations.For the EDU e i being considered, annotators identified other EDUs which shared the same subordinating EDU with e i and before e i .From these EDUs, they selected the EDU that was most semantically coherent with e i (usually the nearest one); and then chose the coordinating relation between them according to the coordinating relation set.
In the experiment section below, we split 300 annotated documents into 200/50/50 to serve as train/dev/test sets, respectively.See appendix B for details of data annotation and statistics.

Evaluation Metrics
Similar to previous dependency parsing tasks (Nivre and Fang, 2017), we apply UAS and LAS for evaluating WebDP.These metrics calculate the percentage of correct predicted edges with respect to all predicted edges.• UAS (Unlabeled Attachment Score) considers a predicted edge to be correct as long as its two terminal nodes are correct, without concerning relation label of edge.• LAS (Labeled Attachment Score) considers a predicted edge to be correct only if both its connecting nodes and relation label are correct.
Thus UAS is an upper-bound of LAS.Notice that WebDP structure includes two different sets of edges, i.e., subordinating edge set and coordinating edge set.Thus, we calculate subordinating, coordinating and overall UAS/LAS for different edge sets respectively.The overall UAS/LAS are micro-averages of subordinating and coordinating UAS/LAS.

Task Formulation
One straightforward way to address WebDP is to model it with a two-stage pipeline, whereby we firstly predict all subordinating relations within a document to establish a backbone of the discourse structure; then we predict all coordinating relations based on the subordinating structure established in the first stage.In the first stage, subordinating relation prediction can be formulated as a conventional discourse dependency parsing task, for which vari-ous existing baseline models can be employed.In the second stage, coordinating relation prediction can be modeled as a classification task, where we introduce an additional classifier layer.We simply feed representations of node pairs, which are consist of successive sibling nodes on the subordinating structure parsed in the first stage, into this classifier layer to predict their coordinating relation labels.During training, all components of the model can be learned jointly and the classifier in the second stage is trained using the golden subordinating structures as input.

Baseline Models
For the dependency parsing model in the first stage, we choose baselines from the state-of-the-art models of Dialogue Discourse Parsing (Afantenos et al., 2015) and Document Logical Hierarchy Extraction.To take advantage of the semi-structured HTML markup information contained in web documents, we use XPath encoders (Li et al., 2022b;Lin et al., 2020;Zhou et al., 2021) to enrich node representation beyond text encoder.
Specifically, the baseline dependency parsers we choose to compare include: 1) NodeBased.A naive solution to the task, simply based on an EDU representation module (node encoder) and a nodepair interaction module (classifier).2) DeepSeq (Shi and Huang, 2019) designs an incremental predicting method to leverage global history information.3) Put-or-Skip (Cao et al., 2022) also adopts incremental decoding while it models the context information of each possible insertion site for the current node.4) SSAGNN (Wang et al., 2021) adopts a fully-connected graph neural network to enhance the modeling of deep interactions between nodes representation.5) DAMT (Fan et al., 2022) is based on SSAGNN and model the dependency parsing task in a multi-task learning manner.Details of implementation can be found in appendix C.

Main Result
We show main experimental results in Table 2.We can see that: 1) WebDP is a feasible task whose patterns can be learned by neural methods.Current methods have an UAS of 60-65 and LAS of 50-55, which are moderate values refer to other documentlevel discourse parsing tasks (Li et al., 2014;Afantenos et al., 2015) and indicate the feasibility of WebDP.Thus, the discourse structure we defined on web documents can be effectively learned and models trained on WebDP have potential to be utilized on downstream tasks.
2) WebDP is a challenging task and current methods leave great room for improvement.Compared with human baseline in Table 2, state of the art parsers still lag far behind, leaving great room for improvement on WebDP.We believe this may because WebDP models discourse structure unique to the free-style genre of web documents and considers discourse relations at a more macroscopic block-level.In the future, designing taskspecific models for better modeling unique features of web documents could be worthy of study.

Detailed Analysis
To further investigate what kinds of documents or EDU instances are more challenging to current models, we conduct analysis from three aspect: the influence of document length, the influence of dependency edge spanning distance and the influence of multiple topics on WebDP performance.From results plotted in Figure 2, we can see that: 1) Long web documents pose challenge to current models.As Figure 2 (a) shows, with the increasing of document length, instance level performance drops gradually in general for all the models.Compared with previous discourse tasks, web documents are usually longer and contain numerous nodes.Specifically, previous Discourse Dependency Parsing datasets have an average 15-20 EDUs per document (Nishida and Matsumoto, 2022); and for Dialogue Dependency Parsing datasets, it is less than 10 (Li et al., 2020).However, in WEBDOCS corpus, we have an average number of 47 EDUs per document some can be even longer than 100 EDUs (Table 6).The larger size and more abundant information in single web document make this task different from its previous analogues and challenging to previous models.
2) Models suffer from problem of poorly modeling long term dependencies.Previous work on Dialogue Dependency Parsing (Fan et al., 2022) demonstrated that EDU-level parsing performance has a negative correlation with the distance of golden dependency edge.Here, we also found in WebDP such a strong negative correlation as Figure 2 (b) shows.Long term dependencies are hard to learn due to the distance bias introduced by data --EDUs that are close in space are more frequently linked together.How to effectively remedy the long term dependencies problem needs further studies.
3) Multiple topics phenomena may come together with more straightforward hierarchy structures, and thus easier for current models.To understand the influence of multiple topics on parsing performance, in Figure 2 (c) we plot document-level performance with respect to the proportion of BREAK labels in coordinating edge set (on behalf of the topic change frequency).Figure 2 (c) shows an roughly positive correlation.This might because that documents aggregated with multiple topics tend to have more explicit and concise hierarchy structure patterns, and the semantic discrepancy between incoherent topics is also obvious enough and easy to capture.On the other hand, how to correctly model multiple hierarchies within single coherent topic seems more challenging to current models.

Error Analysis
Models predict wrong structure occasionally, however, errors may not equal with each other.We define 5 kinds of structure error types to depict features of different subordinating edge prediction errors in Table 3. Structure error types are defined based on the predicted edges with respect to golden structure.From Table 3 we can conclude that: 1) Models tend to predict dummy subordinating nodes.Among wrongly predicted subordinating edges, a huge proportion is because models consider that there is no subordinating node for the target node (Dummy, 25.89%).On one hand, this may caused by the class imbalance problem in data -due to multiple topics, many nodes in the dataset have no subordinating node (Table 6).On the other hand, compared with previous discourse parsing task where explicit relations are abundant, there may be more implicit relations (Dai and Huang, 2018) between blocks in documents, which are more difficult for current models to capture.
2) Models learn semantic correlation and superficial structures from data, while precise information-packing hierarchies underneath semantic correlation is hard to master.Table 3 indicates that although models wrongly select subordinating nodes, there are around half of the wrongly selected subordinating nodes are closely related to target node either in semantics or in layout structure.For example, model tend to consider coordinating nodes as subordinating node (Sibling, 22.81%); mistake coordinating nodes of the golden    subordinating node, which usually have similar layout or superficial language features (Sibling of Parent, 11.13%); or select indirect subordinating nodes across several hierarchy levels (Ancestor, 20.33%).
Both of those error types can be attributed to some semantic correlation or superficial structure shortcuts learned by models, indicating that they still lack exquisite semantic discrimination ability.

Conclusion
In this paper, we inspect web documents from a discourse linguistic perspective to reveal their underlying discourse structure.Inspired by linguistic theories of Rhetorical Relations and SDRT, we build a web document discourse schema which simultaneously models subordinating relations and coordinating relations.Based on the schema we propose a new task WebDP and contribute a dataset to promote the discourse analysis researches on web documents.Experimental results of recent neural models exhibits the challenge of WebDP and detailed analyses provide insights for future studies.We believe the web document discourse schema is prospective in facilitating document-level natural language processing research by explicitly modeling discourse structure for web documents.

Limitations
There may be some possible limitations in this study: 1. Discourse Schema.It should be acknowledged that web documents are heterogeneous themselves and a unified framework to accommodate all web documents may be infeasible.In this article, instead of pre-define the domain/type of web documents we target at, we adopt a problem-motivated research paradigm where we ground ourselves to two characteristics (multiple topics and multiple hierarchies) during discourse schema design.Although we simplify the discourse schema to promote its universality, due to the free-style and domain diversity of web document data, it still has a limited scope of usage, mainly on general news report with multiple topics.For future studies, label sets could be revised in order to better account for the semantic functions in web documents of specific domains, where fine-grained labels and domain-specific labels can be considered.2. Task Setting.In this paper, we only consider parsing the discourse from a list of document logical blocks already pre-processed in advance while do not contain a complete pipeline from input HTML source code to the final output discourse structure in the task setting.In the future, the gap between HTML elements and document logical blocks should be automatically closed in order to apply to downstream application scenarios.3. Data Bottleneck.The annotated data volume in this paper is not big enough due to the expensive labour overhead, which may introduce noise into experiments and distort the performance and analysis.Also, the domain diversity and multilingualism of dataset could be questioned since we collect data from single platform in Chinese.In the future, such data bottleneck can be remedy by more dedicated manual annotation efforts, the help of weak supervision techniques, as well as developing data-efficient models.

A List of Discourse Relation Labels
A.1 Subordinating Relations See Table 4 for the proposed subordinating semantic relations, 7 relation types are designed based on preliminary case study on web documents.Some relation types are borrowed from previous theoretical research while others (LITERATURE, CAPTION and PLACEHOLDER) are added in order to fully account for pragmatic phenomena unique to web document domain.Since the directed subordinating edges can be analogue to parent-child relations in a tree, in Table 4 we use the term "parent node" to refer to the nodes at higher information packaging level (dominating/subordinating nodes) and "child node" to refer to the nodes at lower level (subordinated nodes).

Elaboration
The child node provides a detailed elaboration of the semantic content expressed in the parent node.It could involve situations where the child node is summarized by the its parent node completely or partially; where multiple incoherent child nodes are semantically aggregated by one parent node; where the child node restates the same or similar text as the parent node.

Explanation
The child provides explanations to the parent, provides richer and more detailed information supporting the claim of parent node; or answers questions proposed by parent node.

Topic&Tag
The relationship between nodes is abstract and conceptual.The parent node is usually an entity, concept, or category.The parent node gives a classification tag for the child node or topically give rise to child node.Since semantic information in the parent node is highly abstract, it cannot be considered as a valid summary of the child node.

Attribute
The child node is the attribute value of the parent, providing the content referred to by the parent node.It could involve situations where the child node is cited by the parent node; where the parent node is title of a specific genre of text (e.g., "notice", "declaration") and the child node provides corresponding content; where parent node and child node form a key-value pair relationship in a table.

26.22
Literature This category includes titles commonly found in literary special reports and literary works.The semantics of these parent nodes are primarily literary in nature.They can be considered as placeholders with literary significance to catch attention to the content expressed by their child nodes.Due to their unique nature, they cannot be categorized into any of the above categories.

Caption
The child node provides caption texts which descriptive information or meta information for the parent node, common in caption texts below images.

Placeholder
The parent node does not carry actual semantics; however, its existence allows the child nodes to be integrated into a semantic whole.In this relation, there should be a coherent semantic relationship among the child nodes, and there should be clear semantic boundaries between the group of child nodes and other nodes at the same level.
1.09 HTML source codes are crawled from web sites and we use a naive rule-based parsing script followed by simple manual post-processing to extract content of each HTML element from HTML files.Thanks to the structured property of HTML, by this mean we can easily acquire the content of logical blocks in correct reading order on with lightweight manual post-processing.
We employ two human annotators who have bachelor degrees to give gold discourse structure annotation to the 300 web documents.The annotation stage is composed of an annotator training phrase and a formal annotation phase.During the training phase, annotators are trained with an annotation guideline and a few representative examples which are picked up during preliminary case study.We discuss these examples with the annotators to clear up confusions and revised the final version of annotation guideline.Then, two annotators started to annotate 180 web documents independently in the formal annotation phase.We set a fair salary (35 dollars per hour) to pay the annotators and the training phase is paid as well.

B.2 Data Statistics
See Table 6 for statistical information of the annotated corpus WEBDOCS corpus.Compared with previous discourse parsing tasks, web documents have far more EDUs and each EDU usually contains much longer content, making WebDP a challenging task .

C Experiment Details
All of the 5 compared baselines are re-implemented using PyTorch5 deep learning framework and Hugging Face6 for loading of pre-trained language model checkpoints.During re-implementation, we adapt all models to the fine-tuning paradigm that adopt a pre-trained BERT text encoder (Devlin et al., 2019).Experiments are performed on single GPU TITAN RTX of 24GB memory and a processor Intel(R) Xeon(R) CPU E5-2683 v3 @ 2.00GHz.
We trained all baseline models for 100 epochs using a batch size of 1 and a linear learning rate scheduler with warm-up.The optimizer we use is AdamW.We set the max EDU number of documents to 200 for both training and evaluation.During training, performance on dev set are evaluated by each epoch, and we choose the epoch checkpoint of best dev set performance to report the final performance on test set.We also did brief hyper-parameter searching on dev set to find that a learning rate of 1e-4 and gradient accumulate steps of 8 are preferable.

E Ablation Study
We further conduct a series of ablation studies to investigate the effect of some common practices in previous parsing tasks, as well as whether the introducing of HTML markup feature information unique to that is unique to web documents can benefit the task.The ablation studies are conducted based on NodeBased model, which is simply composed with a node encoder and a classifier module (Table 7. Line 1 "Basic Settings" indicates the ablation setting we reported in main result Table 2 which has best Overall LAS among all settings).coder (Lin et al., 2020;Zhou et al., 2021) and FFN XPath encoder (Li et al., 2022b) to enhance our node representation module.Different modality aggregation methods (Wang et al., 2020;Yu et al., 2022) are also investigated.
For improving the modeling of text information, based on the observation that text piece in nodes often include one or several sentences, we equip models with more advanced sentence representation modules such as SentenceBERT (Reimers andGurevych, 2019) andSimCSE (Gao et al., 2021).
Beside adding Xpath information and advanced sentence embeddings to enhance the node encoder, we also investigate the effect of Global Context Encoder and Biaffine Attention mechanism.Global Context Encoder (e.g., hierarchical GRU, Wang et al., 2021) is a common practice in discourse-level dependency parsing which add a global interaction layer beyond the representation of each single nodes to model higher level context information.Biaffine Attention (Dozat and Manning, 2016) is introduced to replace MLP in classifier module and has the merits of effectively and explicitly modeling interaction between nodepairs.Both of them are means of better modeling interaction between nodes either during encoding or predicting.
Results in Table 7 prove our hypotheses in that: 1) Global Context Encoder is significantly helpful for current model (Line 1 vs Line 3 in Table 7), since it can effective model the context interaction of EDUs in web documents which are usually too long to be concatenated together and loaded into window size of current PLM encoders as previous works (Line 1 vs Line 5) (He et al., 2021;Fan et al., 2022); 2) using Biaffine Attention as classifier is more beneficial than MLP for WebDP (Line 1 vs Line 6), which is in accordance with previous observations; 3) different XPath encoding and modality aggregation methods all outperform baseline without using XPath (Line 1, 8, 9, 10 vs line 7), indicating the addition of such markup information unique to web documents is helpful to the new task.However, 4) what is unexpected is that pre-trained sentence encoding methods based on modification of BERT do not outperform the BERT-base encoder and even damage performance by some margin (Line 1 vs Line 12, 13), this exception may be due to the domain discrepancy between pre-training and downstream data and reason behind need further investigation.

Figure 1 :
Figure 1: Characters and challenges in web documents discourse structure representation.Shown on the top are two representative web document snippets.Web documents, being semi-structured data, naturally incorporate multiple hierarchies of information packaging and exhibit a free-styled nature with multiple topics.These characteristics necessitate the development of a novel discourse schema proposed in this paper (WebDP Structure), which effectively captures discourse phenomena in web documents.By contrast, previous studies on superficial visual document structures (e.g., Layout Structure) fail to faithfully represent either the discourse relation types (as seen in both documents within the figure) or the discourse structure itself (as observed in the document on the right side).Examples of WebDP Structures of real web documents are displayed in appendix F.

Furthermore
, edges in E(d) can be divided into 2 disjoint subsets according to the type of rhetorical relation they hold, i.e. subordinating edge set E

Figure 2 :
Figure 2: Detailed Analysis on the challenges of WebDP.(a) Document-level Subordinating UAS with respect to number of EDUs in document.(b) EDU-level Subordinating UAS with respect to the spanning distance of golden edge.(c) Document-level Subordinating UAS with respect to the frequency of independent topics (proportion of Break label).Metric plotted here is the Subordinating UAS, see appendix D for same analysis on other metrics.
Demonstrated in Figure 3, 4 and 5 are the results on other evaluation metrics for the analysis in § 5.4.Conclusions similar to § 5.4 can be drawn, except for the chaotic pattern of Coordinating UAS and Coordinating LAS on the proportion of break label (Figure 4 (c) and Figure 5 (c)).It might because we use pipeline modeling methods which do not directly learn to predict coordinating edges, thus the tendency on Coordinating UAS/LAS appears with more noise.

Figure 5 :
Figure 5: Detailed Analysis on the challenges of WebDP similar with Figure 2. Metric plotted here is the Coordinating LAS.(a) Document-level Coordinating LAS with respect to number of EDUs in document.(b) EDUlevel Coordinating LAS with respect to the spanning distance of golden edge.(c) Document-level Coordinating LAS with respect to the frequency of independent topics (proportion of Break label).

Figure 6 and
Figure 6 and Figure 8 display the WebDP discourse structures of selected web documents in WEBDOCS dataset, and their corresponding English translation versions are presented in Figure 7 and Figure 9.

Figure 6 :
Figure 6: Selected discourse structure example from WEBDOCS dataset.Some content are omitted with ellipsis for clearer display.An English translation version of the same web document discourse structure and web document snippet can be found in Figure7.

ForFigure 7 :
Figure 7: Selected discourse structure example from WEBDOCS dataset.English translation of Figure 6.Some content are omitted with ellipsis for clearer display.

Figure 8 :
Figure 8: Selected discourse structure example from WEBDOCS dataset.Some content are omitted with ellipsis fordisplay.An English translation version of the same web document discourse structure and web document snippet can be found in Figure9.

Table 1 :
Comparison of previous task settings which aim at modeling structures within documents and our proposed web document discourse parsing.

Table 2 :
Main Results.All compared models are equipped with an additional coordinating relation classifier described in § 5.1 and all reported results for compared models are averaged from 3 random seeds.Human baseline is calculated from doubly-annotated document instances.Best performances among compared models are underlined.

Table 3 :
Statistics of Structural Error Types in subordinating relations.
T: Target EDU, GS: Golden Subordinating EDU, PS: Predicted Subordinating EDU.Shown here are output cases of NodeBased, different compared models have similar error types profile.Examples have been translated into English to display.

Table 4 :
List of subordinating rhetorical relations between different-level nodes.

Table 6 :
Data Statistics of the WEBDOCS Dataset.