QueryForm: A Simple Zero-shot Form Entity Query Framework

Zero-shot transfer learning for document understanding is a crucial yet under-investigated scenario to help reduce the high cost involved in annotating document entities. We present a novel query-based framework, QueryForm, that extracts entity values from form-like documents in a zero-shot fashion. QueryForm contains a dual prompting mechanism that composes both the document schema and a specific entity type into a query, which is used to prompt a Transformer model to perform a single entity extraction task. Furthermore, we propose to leverage large-scale query-entity pairs generated from form-like webpages with weak HTML annotations to pre-train QueryForm. By unifying pre-training and fine-tuning into the same query-based framework, QueryForm enables models to learn from structured documents containing various entities and layouts, leading to better generalization to target document types without the need for target-specific training data. QueryForm sets new state-of-the-art average F1 score on both the XFUND (+4.6%~10.1%) and the Payment (+3.2%~9.5%) zero-shot benchmark, with a smaller model size and no additional image input.


Introduction
Form-like document understanding has become a booming research topic recently thanks to its many real-world applications in industry.Form-like documents refer to documents with rich typesetting formats, such as invoices and receipts in everyday inventory workflow.Automatically extracting and organizing structured information from form-like documents is a valuable yet challenging problem.
Recent methods (Xu et al., 2020b;Garncarek et al., 2021;Lee et al., 2022) often discuss the problem of form-like document understanding, e.g.document entity extraction (DEE), in the supervised setting, assuming the training and test sets are of the same document type.However, in real-world scenarios, there is often the need for generalizing models from seen document types to new unseen document types.Beyond annotation costs, endlessly training specialized models on new types of documents is not scalable in many practical scenarios.Moreover, the methods in the supervised setting pre-define the document schema, i.e.the set of entities contained in the document, following the sequence-to-sequence tagging framework via the BOISE labeling format (Ratinov and Roth, 2009).Consequently, the models lack the ability to learn from different documents with diverse schemas.
Thus, it is desirable to have a systematic way to learn knowledge from existing annotated documents of different types to an un-annotated target document type (e.g.invoice in Figure 2, right).This learning paradigm is usually defined as zeroshot transfer learning in literature (Xu et al., 2021).Beyond this, it is even more desirable to leverage highly-structured form-like documents with rich schema, such as form-like webpages in Figure 2, left.Although webpages do not have explicit human annotations, we believe the diverse schemas and natural "entities", such as headers and text paragraphs, that exist in webpages can be valuable for document understanding.However, how to effectively utilize these webpages with a high discrepancy from documents like invoice and receipt is an unknown yet challenging problem.
In this work, we propose a novel query-based framework, QueryForm, to learn transferable knowledge from different types of documents for zero-shot entity extraction on the target document type.The workflow of QueryForm is illustrated in Figure 1.Ideally, we would like to prompt the model: This document has the following [SCHEMA], please extract its [ENTITY] value, and model is able to accurately predict the corresponding word tokens belong to the queried entity.To this end, we encode both schema and entity information in our query, so that the model is no longer limited by a certain document type and a fixed set of entity types (or classes).Moreover, our query-based design can even benefit further from large-scale datasets with diverse schemas and entity types.
In order to feed this kind of composite query, we propose a dual prompting strategy to effectively prompt the backbone model, e.g., a pre-trained Transformer, to make conditional prediction.As its name suggests, the dual prompting strategy consists of an E(ntity)-Prompt and a S(chema)-Prompt.Depending on the annotations we have, we can either generate the prompts from semantic labels, or learn them directly from data.Although similar concepts to dual prompting exist in the vision field (Wang et al., 2022a,b) to solve different problems, the main design in QueryForm is original in DEE.We also propose a query-based pre-training method, QueryWeb, which leverages the highlyaccessible and inexhaustible resource -publicly available webpages.During the pre-training stage, the model learns to quickly adapt to various queries composed of different S-Prompts and E-Prompts generated from HTML source of webpages.After the decoupling of entity and schema that are tied to document types, the model can learn more transferable knowledge -leveraging the rich layout, scale and content information in webpages to make query-conditional predictions.In summary, our work makes the following contributions: • We propose QueryForm, a novel yet simple query-based framework for zero-shot document entity extraction.QueryForm provides a new dual prompting mechanism to encode both document schema and entity information to learn transferrable knowledge from source to target document types.
• We demonstrate an effective pre-training approach, QueryWeb, that collects publicly available webpages with various layouts and HTML sources, and pre-trains QueryForm via the dual prompting mechanism.Although webpages show high discrepancy from the target documents, we show this approach consistently improves the zero-shot performance.
• With extensive empirical evaluation, Query-Form sets new state-of-the-art F1 score on both Inventory-Payment and FUNSD-XFUND zero-shot transfer learning benchmarks.
More recently, neural models have been the mainstream solution for document entity extraction (DEE).Both RNN-based (Palm et al., 2017;Aggarwal et al., 2021) and CNN-based models (Katti et al., 2018;Zhao et al., 2019;Denk and Reisswig, 2019) have been adopted for DEE task.Nevertheless, motivated by the superior performance of the Transformers (Vaswani et al., 2017) in various NLU tasks (Devlin et al., 2018;Raffel et al., 2020) (Devlin et al., 2018;Conneau et al., 2019;Liu et al., 2019) are readily available for serialized document tokens.Multimodal pre-training (Xu et al., 2020a(Xu et al., ,b, 2021;;Appalaraju et al., 2021) achieves better performance than text-modality alone by incorporating visual information at the cost of more expensive data collection and computation costs.Our work presents a novel pre-training method using text modality alone, which is complementary to models that relies on image modality (Kim et al., 2022) or multiple modalities (Xu et al., 2020b(Xu et al., , 2021)).Moreover, we leverage publicly available webpages, which contain rich structured information and are much more accessible than documents.Different from the common Mask Language Model (MLM) objective used in pre-training, QueryForm has the same query-conditional objective during both pretraining and fine-tuning, which intuitively strength-ens the transferability of pre-trained knowledge.
To the best of our knowledge, DQN (Gao et al., 2022) and Donut (Kim et al., 2022) are the closest work to ours in the DEE domain.However, our work is still very different from them from multiple perspectives, including problem setting, query design, and pre-training technique.On the other hand, leveraging webpages to pre-train language models has been explored in prior work.Liu et al. (2019); Brown et al. (2020) extract text corpora from webpages and (Aghajanyan et al., 2021) use HTML source for pre-training.However, to the best of our knowledge, we are the first to leverage both webpages and the corresponding HTML source in a novel query-based pre-training framework to address the challenging zero-shot DEE task.Our framework fully takes advantage of the rich schema and layout information from webpages and utilizes HTML tags as weak entity annotation to align pre-training with the downstream DEE task.

Problem Formulation
Given serialized words from a form-like document, we formulate the DEE problem as sequence tagging for tokenized words, i.e., for each word, we predict its corresponding entity class.Recent methods (Xu et al., 2020b;Garncarek et al., 2021;Lee et al., 2022) use the BOISE labeling format (Ratinov and Roth, 2009) -classifying the token as {e-Begin, Outside, e-Inside, e-Single, e-End} of a certain entity e ∈ E to mark the entity span, where E is the set of entities of interest.Thus the cardinality of the label space will be (4 × |E| + 1).In our formulation, we explicitly encode entity in the E-Prompt.Therefore, we are able to use a more succinct and generalizable BOISE labeling with only 5 labels, {Begin, Outside, Inside, Single, End} to mark the span.Our approach decouples the label space with entity types.Following Lee et al. (2022), we then apply the Viterbi algorithm to get the final prediction.
In our work, we focus on the zero-shot DEE setting proposed by Xu et al. (2021), where 1) the training source documents have significant domain gap from the target test documents (e.g.languages or document types), 2) there is no training documents available from the target documents, and 3) source documents include entities contained in the target documents.

Architecture design
Following the setting in earlier work (Majumder et al., 2020;Lee et al., 2022), our method takes the WordPiece (Wu et al., 2016) tokenized outputs from the Optical Character Recognition (OCR) engine in reading order (left-right and top-bottom).By design, our method is compatible with any sequence encoder model as the backbone.We adopt the long-sequence transformer extension, ETC (Ainslie et al., 2020) as our backbone, following the adoption of Lee et al. (2022), which contains Rich Attention as an enhancement of selfattention layers to encode 2D spatial layout information.We find this method (used as our baseline) performs fairly strong in the usual supervised learning setup, however, its performance drops significantly in the zero-shot learning setting.Note that in practice, one can use QueryForm with OCR engines with different heuristics or other model backbones (Zaheer et al., 2020).The work focuses on how to enrich entity query abilities from forms via our proposed QueryForm.

Methodology
We propose QueryForm as a general query-based framework for solving the zero-shot DEE problem.QueryForm consists of a novel dual prompting strategy and a specially-designed pre-training approach called QueryWeb.In Figure 3, the model is first pre-trained on a large-scale webpage dataset to learn to make conditional prediction under rela-tively noisy queries generated from combination of webpage domains (proxy of schema) and HTML tags (proxy of entity).Then, the model is fine-tuned on form-like documents with a unified schema to learn more specialized knowledge, by learning schema information in S-Prompt and further encode more accurate entity-level knowledge in E-Prompt.Finally, we test the model on the target document type in a zero-shot fashion (Figure 1).

Dual Prompting
Given a serialized document represented as a sequence of tokens x from the set of all documents X, and a set of entities of interest E = {e 1 , • • • , e m }, the goal is to let the model predict the corresponding label sequence y.In our query-based framework, we additionally define Q = {q 1 , • • • , q m } as the set of queries, where there is a bijection between Q and E. The model takes an input tuple (q i , x) and predicts the conditional output y q i (see BOISE prediction in Figure 3 for example).y q i defines the token spans of the given query q with 5 classes (i.e., BOISE).
To encode entity information into the query, we can use the entity name as query, i.e., q i = e i .We denote by t the tokenizer, f θ the input embedding layer and p ϕ the rest of the language model.Then we can get the token-wise BOISE prediction: where "[• ; •]" is the concatenation operation along the token length dimension.Note that although e itself is not learnable, we can still learn its embedding f θ (t(e)) by optimizing θ.
We name the query directly generated from the entity name as E-Prompt.However, in QueryForm, our novel pre-training stage requires learning from large amount of webpages, which contain diverse categories of schema.Therefore, the model naturally requires more informative queries that also encodes the schema information.To this end, we propose the S-Prompt to capture schema information.In pre-training, we can generate S-Prompt in a similar way as we do for E-Prompt, please see Section 4.2 for more details.During fine-tuning, the schema for form-like documents is often very different from that of webpages.Thus, we let the model learn the schema representation directly from the data, so that it can align well to the S-Prompts used in pre-training.We denote S-Prompt by s, learnable vectors in the token embedding space during fine-tuning to capture schema information implicitly from the data.According to the assumption in Section 3.1, the documents we used in fine-tuning includes the target entities of interest.Intuitively, the schema information from finetuning documents should be transferrable to the target test document type.So we directly reuse the learned S-Prompt when testing the target documents.In this case, q i = (s, e i ) and the prediction becomes: (2) Finally, the model is trained with the objective: where L is the cross-entropy loss.(2) Utilizing easy-accessible and informative webpages.
Recall that, in the fine-tuning stage, QueryForm is trained with a moderately sized set of queries composed of E-Prompts generated from humanannotated entities and a learnable S-Prompt that encodes schema information.It is reasonable to believe that if we can pre-train the model with an extremely large set of queries composed of different E-Prompt and S-Prompt generated from weaklyannotated documents, the model will perform better than the Masked Language Model (MLM) (Devlin et al., 2018) pre-training alone.Here, we present our simple webpage-based pre-training technique as well as our data collection recipes to enpower the pre-training.
Dual prompting based pre-training.We directly extract schema and entity information from rich HTML structure of various webpages and use them to generate the S-Prompt and the E-Prompt, respectively.With a slight abuse of notation, we denote S-Prompt by s, where • indicates that S-Prompt is no longer a learnable parameter.Different from the fine-tuning stage with a single set of entities under a unified schema, we can group the webpages by schema: where each schema sj corresponds to a set of entities E j and a set of webpages X j .Similarly, the model takes the query-document tuple (q ij , x), where q ji = (s j , e ji ), e ji ∈ E j and x ∈ X j , and outputs the following conditional prediction ŷq ji : Equation 4 is analogous to equation 2, however, s i here is directly sourced from webpage data, which makes it different from the learnable s in equation 2. Then we have the following pre-training objective: The pre-training format is highly aligned with the fine-tuning in equation 3, so the model learns consistently during both stages to make queryconditional predictions.Data collection recipe.How to extract schema and entity information from any webpage is another contribution of this paper.Consider the HTML snippets from Figure 4. First, it naturally contains two entities, and the combination of HTML tags defines what the entity is about, or its "entity type".Therefore, we have "Bath Mat" is of entity product/name, and "$13.99" is of entity product/price.Second, the schema of the webpage = {product/name, product/price}, and the schema is usually shared by a series of similar webpages under the same domain.Therefore, we can extract the domain name "www.example.com"as the schema information.Both the schema information and entity types generated from webpages are then respectively encoded by our dual prompting mechanism.
In practice, the schema and entity information automatically generated from webpages are often noisy.However, in the experiments, our model is still able to learn structured information from noisy queries and obtain significantly better entity extraction performance on the target form-like documents.Moreover, in order to represent webpages in a manner that generalizes to form-like documents, the webpage representation consists only of the visible text tokens and corresponding x/y coordinates. 1 5 Experiments

Datasets and Experiment Design
We use 3 publicly available datasets and 2 in-house datasets that we collected to design and conduct extensive experiments to validate our method.form understanding benchmark by extending the FUNSD dataset.The XFUND benchmark has 7 different languages with 1,393 fully annotated forms, where each language includes 199 forms with the same set of 4 entity types as FUNSD.
Payment (Majumder et al., 2020) consists of around 10K documents and 7 entity types from human annotators.The corpus is collected from different vendors with various layout templates.In the few-shot learning experiments, we create multiple subsets by randomly subsampling documents from its training set.
Inventory is collected by us that contains inventory-related purchase documents in English (e.g., utility bills), containing a few document types different from the Payment dataset.We prepare the dataset consists of ∼ 24k documents in two annotated versions.The first version, Inventory-7, is annotated at word-level with the same 7 entity types from Payment.The second version, Inventory-28, is annotated at word-level with 21 additional entities types, including common entity types such as shipping address, supplier name.
QueryWeb is collected by us from publicly available webpages in English from the Internet with the acquisition procedure stated in Section 4.2.
QueryWeb-ML is a multilingual (ML) version of QueryWeb with more than 50 languages (at 99% percentile).We collect the dataset to validate the effectiveness for multilingual pre-training for zeroshot generalizability across different languages.

Experimental Details
We use the BERT-multilingual vocabulary (Devlin et al., 2018) to tokenize the serialized OCR words.We have two variants of QueryForm: a 6-layer ETC with 512 hidden size and 8 attention heads and a 12-layer ETC with 768 hidden size and 12 attention heads.For both S-Prompt and E-Prompt generated from dataset annotations, we use a maximum token length of 32 with zero padding.For learnable S-Prompt used in the fine-tuning stage, we treat its token length as a hyperparameter to search.
Our method uses the proposed QueryWeb pretraining approach on the 2 large-scale webpagebased datasets.Other comparing methods use MLM pre-training with the corresponding datasets mentioned in their papers, including ∼0.7k unlabeled form documents for ETC+RichAtt (Ainslie et al., 2020), IIT-CDIP dataset (Lewis et al., 2006) with 7M documents for FormNet (Lee et al., 2022)   30M multilingual documents for LayoutXLM (Xu et al., 2021), and 2.5TB multilingual Common-Crawl data for XLM-RoBERTa (Conneau et al., 2019) and InfoXLM (Chi et al., 2020).We use mirco-F1 to evaluate the performance on XFUND related expeirments, following Xu et al. (2021); macro-F1 to evaluate Payment related experiments, following Majumder et al. (2020); Lee et al. (2022).We report the mean of best experiment results across three runs with difference seeds.Please see Appendix B for more experimental details.

Zero-shot Transfer Learning Results
To evaluate QueryForm, we introduce two zeroshot transfer learning tasks and one few-shot learning task, as shown in Table 1.We follow the official train-test split for all public available datasets by default, unless specified explicitly.For zero-shot on Payment test set, we pre-train on QueryWeb and fine-tune on Inventory.FUNSD-XFUND.In Table 3, we compare Query-Form with recent zero-shot transfer learning meth- ods, include XLM-RoBERTa (Conneau et al., 2019), InfoXLM (Chi et al., 2020), and the current state-of-the-art LayoutXLM (Xu et al., 2021)  current state-of-the-art on Payment, FormNet (Lee et al., 2022), and our baseline ETC+RichAtt (see Section 3.2).QueryForm outperforms competing methods by a significant margin.Although Form-Net obtains the best supervised upper-bound result on Payment, the lower zero-shot results indicate that knowledge transfer from different types of documents is still very challenging.
QueryForm is expected to take advantage of larger number of queries though they are less relevant.To validate, we compare fine-tuning datasets with 7 and 28 annotated entities.As can be seen, supervised methods like FormNet and ETC+RichAtt suffer performance drop when seeing additional entities not existed in target dataset, while QueryForm gains further performance improvement.Ablation study.We conduct ablation study of QueryForm.From the results in Table 5, we can see that both our dual prompting strategy and Query-Web pre-training contribute to the zero-shot F1 score individually, and synergistically improve the performance when working together.

Few-shot Learning on Payment
In practice, it is reasonable to believe that a few annotated document from the target document type can make models quickly adapt.Therefore, we design a few-shot learning step based on the best performing model obtained on the Inventory dataset.
Table 6 shows the 1-and 10-shot results on Payment.To make sure the training is stable on low data regime and comparison is fair, we conduct hyperparameter search (e.g., learning rates, # of freezing layers) for all methods and select the best performing ones to present.When fine-tuning on the extreme Payment 1-shot, both FormNet and ETC-RichAtt overfit the single document from Payment severely, while QueryForm maintains high performance2 .When extending to 10-shot, all methods improves, and QueryForm still perform the best.Although FormNet is state-of-the-art in supervised learning setting, it underperforms other methods on low data regime.We hypothesize GCN requires more data to learn layout features.

Result analysis
Prediction visualization. Figure 5 demonstrates an example output of QueryForm.QueryForm infers entities that are annotated as "others" in ground truth as one of the other three entity types with concrete meanings.For example, "Type" is a question in the form, however, without corresponding answer.Although human annotators might find it ambiguous and mark it as "others", QueryForm successfully recognize it as a "question".Loss visualization. Figure 6 shows the loss curves of pre-training on QueryWeb (Left) and fine-tuning on Inventory (Right).According to the pre-training loss curve, we observe that the loss converges well despite the fact that the weak supervision extracted from webpages is often noisy.Moreover, according to the fine-tuning curve, we observe that the loss converges very fast, thanks to the knowledge learned during pre-training.The observations indicate that our framework successfully extracts useful information from the weak supervision and leverages the learned information to facilitate finetuning on form-like documents.

Conclusion
This paper presents QueryForm, a novel framework to address the challenging zero-shot document entity extraction problem.The dual prompting design in QueryForm offers a refreshing view to unify the pre-training and fine-tuning objectives, allowing us to leverage large-scale form-like webpages with HTML tags as weak annotations.QueryForm sets new state-of-the-art results on multiple zero-shot DEE benchmarks.We believe QueryForm serves as a flexible framework for document understanding tasks, and multiple interesting directions could be further explored within the framework, such as prompt design, richer pre-training sources, etc.

Ethical and Broader Impact
We have read the ACL Code of Ethics, and ensure that our work is conformant to this code.As a novel framework for zero-shot document entity extraction, QueryForm has a great potential to boost the performance of existing DEE systems.However, we would still like to discuss the limitations and risks to avoid any misuse of QueryForm.
Although our proposed QueryWeb pre-training approach can effectively achieve knowledge transfer from publicly available webpages to form-like documents, it inevitably carries the bias and fairness problems (Mehrabi et al., 2021) to the downstream task.Therefore, in real-world applications, we should have more strict rules to filter and clean up the webpages, and thoroughly check the bias and fairness issues of the pre-trained model.

Limitations
In addition to the bias and fairness concerns that we discussed in the Ethical and Broader Impact section, we discuss the possible limitations of our method in this section.
As a query-based DEE framework, QueryForm may be prone to specific prompting based adversarial attacks (Xu et al., 2022), which may further pose potential security concerns for safety-critical documents.Thus, it is important to test the robustness of QueryForm against adversarial attacks and design defense schemes to further strengthen our method in the future.
Our work focuses on the closed-world setting that source documents include entities contained in the target documents, following (Xu et al., 2021), without further investigating the possible openworld (Shu et al., 2018) setting with unseen test entities.However, as a query-based framework that makes conditional prediction with no pre-defined set of entities, QueryForm actually supports the prediction of unseen entities at test time and we would like to leave it as an interesting future research direction.
C2. Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values?Left blank.
C3. Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?Left blank.
C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)?Left blank.

D
Did you use human annotators (e.g., crowdworkers) or research with human participants?
Left blank.
D1. Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.? Left blank.
D2. Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?Left blank.
D3. Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?Left blank.
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?Left blank.

Figure 1 :
Figure 1: Illustration of the zero-shot transfer learning stages of QueryForm.In the pre-training stage, we extract millions of schemas and entity-value pairs from publicly available webpages to generate a large amount of query-value pairs to teach the backbone model to make query-conditional prediction.During fine-tuning, we extract more accurate entity-value pairs from the available annotated document and directly learn schema information from data.Finally, we evaluate the pretrained model on a different target document without training data.

Figure 2 :
Figure 2: Form-like examples of Webpage and Invoice documents.Webpage appears to have distinct layouts and contents from invoice documents, but they both contain rich entity-value pairs, such as "page title-The 61st Annual Meeting of the Association for Computational Linguistics" in Webpage and "total amount -$755" in Invoice.

Figure 3 :
Figure 3: Overview of QueryForm.Our dual prompting design yields a consistent objective in both pre-training and fine-tuning stages.Note that the schema query in pre-training comes from website domains while it is a learnable parameter in fine-tuning.See Section 4 for more details.

4. 2
QueryWeb: Webpage-based Pre-training Distinct from recent work that focuses on multimodal pre-training, our proposed pre-training approach provides a new perspective with two core ideas: (1) Aligning pre-training and fine-tuning objectives.

Figure 4 :
Figure 4: An example of HTML snippets, with two entities product/name and product/price.

Figure 5 :Figure 6 :
Figure 5: Visualization example from XFUND (French).QueryForm labels entities with ambiguous "others" annotation from ground truth as one of the other three entity types with concrete meanings.

Table 1 :
Experiment design of two zero-shot transfer learning and one few-shot learning tasks.

Table 2 :
Detailed statistics of used datasets.

Table 4 :
Comparison between QueryForm and previous state-of-the-arts on Inventory-Payment zero-shot benchmark.I-7 and I-28 are abbreviations of Inventory-7 and Inventory-28, respectively.QueryForm has much better generalization ability indicated by its stronger zero-shot performance and smaller gap with its supervised upper-bound (trained and tested both on Payment).

Table 5 :
Ablation study of QueryForm on Inventory-Payment benchmark.E-P and S-P are abbreviations of E-Prompt and S-Prompt, respectively.

Table 6 :
Comparison between QueryForm and comparing methods further fine-tuned on few-shot Payment training sets.