\textrm{DuReader}_{\textrm{vis}}: A Chinese Dataset for Open-domain Document Visual Question Answering

Open-domain question answering has been used in a wide range of applications, such as web search and enterprise search, which usually takes clean texts extracted from various formats of documents (e.g., web pages, PDFs, or Word documents) as the information source. However, designing different text extraction approaches is time-consuming and not scalable. In order to reduce human cost and improve the scalability of QA systems, we propose and study an \textbf{Open-domain} \textbf{Doc}ument \textbf{V}isual \textbf{Q}uestion \textbf{A}nswering (Open-domain DocVQA) task, which requires answering questions based on a collection of document images directly instead of only document texts, utilizing layouts and visual features additionally. Towards this end, we introduce the first Chinese Open-domain DocVQA dataset called \textrm{DuReader}_{\textrm{vis}}, containing about 15K question-answering pairs and 158K document images from the Baidu search engine. There are three main challenges in \textrm{DuReader}_{\textrm{vis}}: (1) long document understanding, (2) noisy texts, and (3) multi-span answer extraction. The extensive experiments demonstrate that the dataset is challenging. Additionally, we propose a simple approach that incorporates the layout and visual features, and the experimental results show the effectiveness of the proposed approach. The dataset and code will be publicly available at https://github.com/baidu/DuReader/tree/master/DuReader-vis.


Introduction
Open-domain Question Answering (Open-domain QA) is a task that requires answering questions based on a collection of document texts.It has been used in a wide range of applications, such as web search (He et al., 2018;Nguyen et al., 2016;(b) The procedure in open-domain DocVQA, which utilizes a universal extractor to get textual contents and layout information.Chen et al., 2017a), enterprise QA (Castelli et al., 2020), biomedical QA (Levy et al., 2021), etc.
The typical procedure of an open-domain QA system can be summarized in Figure 1(a).It needs first to design specific text extraction methods for real-world documents in different formats (e.g., PDFs, web pages, scanned documents, etc.), and extract certain text contents from them (e.g. the main body of web pages).Since there is no universal method for text extraction, it is expensive to build a unified QA system that can process documents in different formats as the information source.This greatly limits the scalability of QA systems, where a scalable QA system should process various formats of documents at a low cost, and not be restricted by the document format.In addition, the visual layouts (e.g., font size, list format, and table format) and the visual features (e.g., text color, pictures, and figures) will be lost after text extrac-tion, which are of great significance to comprehend documents.
To tackle the above limitations, we propose and study an Open-domain Document Visual Question Answering (Open-domain DocVQA) task, which takes a collection of document images (converted from real-world documents) as the information source to answer questions, as shown in Figure 1(b).In this task, we apply a universal document extractor (e.g., OCR) to extract all the texts and layouts from the document images and then utilize them along with the visual features to perform the following procedures, including Document Visual REtriever (DocVRE) to retrieve relevant document images, and Document Visual Question Answering (DocVQA) to extract answers from retrieved document images.The open-domain DocVQA task encourages us to design an open-domain QA system that can be applied to various data sources in a scalable way, leveraging text, layout, and visual information simultaneously.
In open-domain QA, it is intuitive to build the corresponding datasets from the ones for machine reading comprehension (that requires answering questions based on one or a few documents), e.g., Natural Questions Open (Kwiatkowski et al., 2019), SQuAD-open (Chen et al., 2017b).With the development of document intelligence research, several datasets of visual machine reading comprehension (or question answering) have been created, such as VisualMRC (Tanaka et al., 2021), InfographicVQA (Mathew et al., 2021a) and DocVQA1 (Mathew et al., 2021b).However, the questions are collected in a crowd-sourced way rather than from real-users' information seeking questions which makes them not suitable for Open-Domain DocVQA research.Besides, most document images in existing datasets are short documents with simple layouts and few visual features, but we often need to take more complex documents in open-domain scenarios.Furthermore, the answers in existing datasets are mainly short (e.g., entities, numbers, etc.).In contrast, we have longer answers in various formats like paragraphs, lists, and tables in real applications.Except the above limitations, there are very few Chinese datasets to the best of our knowledge.
To deal with the limitations above, we intro- The experimental results show that there are three main challenges (Section 5.4.1) in DuReader vis : 1) long document understanding, where the document images are converted from long documents with rich visual features and complex layouts; 2) noisy texts, such as the advertisements and related links in web-pages, increasing the difficulty of understanding the documents; and 3) multi-span answer extraction, where the actual answers could be multispan texts, lists, and tables.Furthermore, the additional zero-shot study (Section 5.4.2) on real-world documents in different formats (including PDFs, Word documents, and scanned images) demonstrates the scalability of our approach and the good transferability of models trained on DuReader vis .
Our main contributions are as follows: • We propose and study an open-domain DocVQA task to encourage developing an open-domain QA system that can be applied to various data sources in a scalable way, without the expensive and specific efforts to text extraction.
• We introduce the first Chinese open-domain DocVQA dataset DuReader vis with three main challenges: long document understanding, noisy texts, and multi-span answer extraction.
In previous works, a two-stage approach is usually used to solve the task, i.e. a document retrieval stage with BM25 (Chen et al., 2017b) or dense retrieval (Karpukhin et al., 2020a;Qu et al., 2021a), and a document question answering stage with a machine reading comprehension model (Karpukhin et al., 2020a;Mao et al., 2020).However, as mentioned in Section 1, the specific text extraction method makes the real open-domain QA applications hard to be scalable and loses layouts and visual features that may be necessary for document understanding.

Document Visual Question Answering
Document Visual Question Answering (DocVQA) is a task to answer questions based on a given real-world document image.In DocVQA (Mathew et al., 2021b), document images are collected from the Industry Documents Library, covering different document types like tables, forms, and figures, while the answers are mainly entities and numbers.VisualMRC (Tanaka et al., 2021) is an abstractive DocVQA task, where document images are a small part of a Wikipedia web page.Info-graphicVQA (Mathew et al., 2021a) focuses on elementary reasoning skills such as counting, sorting, and arithmetic operations.Nevertheless, all the questions in these datasets are not informationseeking questions (Dasigi et al., 2021) from real users but are generated by annotators with known documents, making these datasets unsuitable to be extended as open-domain DocVQA datasets.Be-sides, these datasets have few document images with long documents and complex layouts, and their answers are mainly short answers such as entities and numbers.
As a comparison, we focus on building a new dataset DuReader vis , which consists of (i) real questions from real-world users; (ii) long document images; (iii) long annotated answers with various answer types and multi-span answers.A detailed comparison between DuReader vis and existing DocVQA datasets is shown in Table 1.

DuReader vis
This section defines the task formally, then shows the data collection and annotation process, and finally conducts the statistics and analysis.

Task Overview
DuReader vis is a Chinese dataset for Open-domain DocVQA.Given a collection of document images Ī as the information source, a system is asked to extract one or multiple text spans from Ī as the answer A of the question Q.The task contains two stages: 1) the Document Visual Retrieval (DocVRE) stage to retrieve relevant document images Î that may answer the question Q from the whole document image collection Ī (| Ī| ≫ | Î|); and 2) the Document Visual Question Answering (DocVQA) stage to extract the answer A from the relevant document image set Î.

Data Collection and Annotation
This subsection describes the data collection, the annotation procedure and the quality control during annotations.

Question Collection
We randomly sample 40K queries from the search log of Baidu and then apply a pre-trained question classifier (with precision and recall higher than 92%) to filter out non-question queries, leaving about 18K queries.Then, we ask annotators to further filter out pornography or violence-related questions.Eventually, we hold about 16K questions.

Document Image Collection
After question collection, we need to collect document images for the the DocVRE and DocVQA stages in our task.For the DocVQA stage, we take the whole screenshots3 of the top-4 web pages in the Baidu search results (drop the unavailable web pages) by an open-source tool Puppeteer4 for each question in the collected 16K questions as the document images to annotate answers.Then, to build a larger document collection for the DocVRE stage in our task, we randomly sample more document images in the same way through other insensitive queries.Finally, there are about 158K document images in our collection.

Answer Annotation and Quality Control
Finally, we annotate answers through the collected 16K questions and their relevant document images.Each annotated sample consists of a question, one of its relevant document images, and the corresponding document URL.The annotator must extract the answer text and mark the answer type.The sample will be removed if the text content in the document image does not contain the correct answer.
There are three answer types: text, list, and table.If the answer is in the list type or the table type, the annotators must annotate all the list items or the table cells that can answer questions.Finally, after filtering out questions with no annotated answers, we obtain 15K question-answer pairs, with 14K unique questions.
To ensure the data quality, we perform the annotation in an internal annotation platform, where all the annotators and reviewers are formal employees and native speakers.The data samples are divided into packages during annotation, with 1000 samples for each.For a single package, the annotators extract the answers first.Then at least two reviewers check the accuracy of this package by reviewing 100 random samples independently.If the average accuracy is below the threshold (i.e., 93%), the annotators will be asked to revise the answers, until the accuracy is higher than the threshold.

Statistics and Analysis
In this subsection, we will analyze the statistical features of DuReader vis .DuReader vis has 14K unique questions, and 158K document images, and 15K question-answer pairs in total.We randomly split the samples, and there are 11K, 1.5K, and 2.5K question-answer pairs in the training, development, and test sets.We will provide questions, document images, answers, and document URLs in our dataset and make the dataset public only for research purposes.There is an example of the question-answer pair shown in Figure 2.

Document Images
As shown in Table 2, the average length of textual contents in the document images of DuReader vis is 1968.21,which is significantly longer than DocVQA (182.75),VisualMRC (151.46) and In-fographicVQA (217.89).Modeling such a long sequence is a challenging task for many pre-trained language models (e.g., Devlin et al. 2019;Liu et al. 2019) due to limited input length (usually less than 512 tokens), making the first challenge in DuReader vis .
In addition, the document images in DuReader vis come from over 17K random websites, thus are diverse in topics and document layouts.With such long documents, rich visual features, and complex layout, the noise in the sample will be inevitable, making it the second challenge in DuReader vis .

Questions and Answers
Existing DocVQA datasets contain mostly factoid questions, with mainly short entities and numbers as answers.In comparison, DuReader vis contains both factoid and non-factoid questions.
To demonstrate the diversity of question types in DuReader vis , we randomly check 200 questions and classify their type as factoid or non-factoid, and the results show that there are 43% of the questions are non-factoid.Besides, the answers in DuReader vis are more complex.In fact, only 40% of the answers are normal text.There are 25% list answers and 35% table answers, of which the answers are likely to be discontinuous and have to be modeled as multi spans.As shown in

Proposed Model
In this section, we propose a simple baseline for DuReader vis .The approach contains three parts: 1) the Universal Document Extractor to obtain textual contents, layout and visual information as the input, 2) the Document Visual REtriever (DocVRE) to retrieve relevant documents, and 3) the Document Visual Question Answering (DocVQA) to extract answers from retrieved documents.

Universal Document Extractor
Different formats or sources of documents require different content extractors.For example, we need to write different crawlers and parsers to extract texts from documents on different websites.The content extractor for PDFs and Words are also not universal across different sources and tasks.Moreover, contents from scanned documents and images can only be extracted by OCR5 .To extract contents from various formats of documents in a more universal and scalable way, we directly convert all formats of documents into document images and adopt OCR to obtain the texts and layouts.Given a document image I, we firstly parse the document image by an OCR engine to obtain the textual document D = {d 0 , d 1 , ..., d i , ..., d n } and the rectangular bounding boxes B = {b 0 , b 1 , ..., b i , ..., b n }, where n is the document length, d i is the i-th token in the document, and b i = (x i 0 , y i 0 , x i 1 , y i 1 ) denoting the left, top, right, and bottom position of the i-th token boundary.

Document Visual Retriever
DocVRE aims to retrieve relevant documents from an extensive collection of documents.In this paper, we adopt the text contents in document images to build the retrieval library and use BM25 (Robertson and Zaragoza, 2009) to retrieve relevant documents.

Document Visual Question Answering
DocVQA aims to extract answers in the documents returned by DocVRE, that contains two challenges: long document understanding and multi-span answer extraction.For the long document understanding, we utilize a Hierarchical LayoutXLM (Xu et al., 2021) (Hi-LayoutXLM) to model the interaction within the documents by using text, layout, and visual information.Then, we extract multispan answers by a sequence labeling method based on CRF (Conditional Random Fields).The model of DocVQA includes three stages: the paragraph encoder, the document encoder, and the answer extractor, as shown in Figure 3.
Paragraph Encoders: We take LayoutXLM as the paragraph encoder, which accepts text, layout, and visual features as inputs.Due to the input length limitation of LayoutXLM, we split the document and the bounding box (D, B) into m groups from ( D1 , B1 ) to ( Dm , Bm ).Then, for each group ( Dj , Bj ), we feed them along with the question Q and the whole document image I into the same LayoutXLM to get the hidden representations H j of each token in the Dj .Initially, the LayoutXLM encodes the concatenation of the question Q and  Answer Extractor: Finally, we apply a CRF layer to label all the answer spans with the "BIOES" label format in the answer extractor.Similar to Named Entity Recognition (NER), "B" denotes the first token of the answer span, "I" denotes the subsequent tokens inside the answer span, "E" denotes the end of the answer span, and "O" denotes tokens outside answer spans.Besides, if there is only one token in the answer span, it will be labeled as "S".A Viterbi algorithm (Viterbi, 1967) is adopted to decode the tag sequence with the highest probability.There is an example of the multi-span answer shown in Appendix A.2 in Figure 5.

Experiment
In this section, we firstly describe the experiments we conduct on DuReader vis dataset and then conduct further analysis and discussion.Case studies are shown in Appendix A.2.

Experimental Setup
In this subsection, we describe the experimental baselines and evaluation metrics.

Baselines
For DocVRE, we use BM25 to retrieve relevant document images as the baseline.For DocVQA, we apply two text-based pre-trained models (containing RobertaXLM-base (Liu et al., 2019) and BERT-base-Chinese (Devlin et al., 2019)) into our proposed framework as the baseline, where we only use the textual contents as the input and replace the LayoutXLM with text-based pre-trained models.

Evaluation Metrics
For DocVRE, we evaluate the retrieval results by Recall@5, Recall@10, and MRR.For DocVQA and Open-domain DocVQA, we concatenate all answer spans together and use Rouge-L and F1 to evaluate the answer extraction performance.The details are shown in Appendix A.1.

Implementation Details
We utilize Paddle-OCR6 to parse document images.
The OCR results containing texts and bounding boxes are sorted by line.We evaluate PaddleOCR on our dataset and get F1 above 90, which shows that OCR is not the bottleneck of our task.For DocVQA baselines, we set the max paragraph number to 8, meaning the max document token length is about 4000.We truncate documents with more tokens.The Hi-LayoutXLM has about 200M parameters.We train 10 epochs using the AdamW  Loshchilov and Hutter, 2017) optimizer with a 3e-5 learning-rate.

Main Results
The results of DocVRE, DocVQA and Opendomain DocVQA (DocVRE+DocVQA) are shown in Table 4, Table 3, Table 5 respectively.DocVRE: From Table 4, we can see that BM25 obtains decent performance for retrieval, and the top 1 document will be used for DocVQA in the Open-domain QA setting.DocVQA: As shown in Table 3, Hi-LayoutXLM performs best overall baselines, which denotes that the layout and visual features provide benefits for understanding document images.The results also show that there is still a performance gap between baseline models and human performance.Except for the overall performance, we also report the performance of the baselines on each answer type.All the models obtain better results on the list type since list items commonly have indicators like numbers and have similar layouts in the document, assisting models to gain further improvements (also shown in Figure 2).The performance gap between Hi-RoBERTaXLM and Hi-LayoutXLM in the table-type and list-type answers are bigger since lists and tables have rich layout and visual information, which is conducive to the list and table understanding.In comparison, the text-type answers focus more on understanding the text content, where the layout and visual information cannot provide too many benefits, thereby the performance gap is smaller.
Open-domain DocVQA: The experimental results of Open-domain DocVQA are shown in Table 5.Since DocVRE can not perform perfectly, the performance of our model on the opendomain DocVQA has decreased compared to that on the DocVQA.Compared to "BM25+Hi-BERT" and "BM25+Hi-RoBERTaXLM", our method also performs better, with the same observations as DocVQA results.

Analysis and Discussion
In this subsection, we will perform more analysis to demonstrate the three challenges in DuReader vis , the scalability of our approach, and the performance gap between our approach and open-domain QA.Finally, we will show the error case study and give some promising future directions.

Challenges in DuReader vis
As mentioned above, our dataset has three main challenges: 1) long document understanding, 2) noisy texts, and 3) multi-span answer extraction.
In this subsection, we will analyze their influence on DocVQA respectively.
Long Document Understanding: We conduct a statistical analysis of the relationship between the model performance and the document length on the development set of DuReader vis using Hi-LayoutXLM.As shown in Figure 4, documents in DuReader vis mainly have 500 to 3000 tokens.As the length of the document increases, the model performance gradually decreases.The increased document length makes documents harder to understand and makes models harder to extract valuable information to answer questions.Noisy Texts: Documents from web pages commonly contain much noise, such as advertisements, relevant recommendations, etc.It is hard to distinguish between main contents and noise accurately, so we roughly remove noise through a heuristic algorithm.Then, we conduct a comparison experiment between whether denoising or not.After removing noise, there is a performance improvement (F1: 53.10 → 57.24 (+4.14), and Rouge-L: 50.94 → 54.83 (+3.89)).We attribute it to the fact that noisy texts increase the total amount of information in the document, and it is easier for models to focus more on the valuable information after removing noise.
Multi-span Answer Extraction: In this part, we perform experiments to verify that the task of the multi-span answer extraction is more challenging than that of the single-span answer extraction.We convert the multi-span extraction task to the singlespan extraction task by concatenating all answer spans together and inserting the concatenated answers to the position of the first answer span.And then we make a comparison between the above two tasks.Compared with the multi-span answer extraction, the single-span answer extraction performs better (F1: 53.10 → 59.38 (+6.28), and Rouge-L: 50.94 → 58.08 (+7.14)), which indicates that the task format of the multi-span answer extraction is harder to model.If there is no multi-span answers, our task will be easier.

Zero Shot Study
In this subsection, we randomly select 100 questions (do not occur in DuReader vis ) from Baidu and obtain the most relevant documents from a large The goal of the open-domain DocVQA is to develop a more scalable QA system that can be applied to diverse domains and document formats.In section 5.4.2, we have shown that our approach can be applied to various formats and achieve decent performance.In the future, we aim to achieve competitive (even better) performance compared to well-designed format-specific or task-specific QA systems.
In this subsection, we design an experiment to see the performance gap between our scalable open-domain DocVQA system (as shown in

Error Case Study and Future Directions
We randomly sample and manually analyze 100 error cases with Rouge-L lower than 0.5 from the prediction results.There are 55% wrong samples in the retrieval stage, caused by the lousy string matching in the BM25 without understanding the semantics.In addition, about 15% of samples have multi-span answers, about 15% of samples make almost complete wrong predictions, and about 5% of samples output no answer.Furthermore, about 10% of samples have noisy texts.
From the results, we can give some promising directions to improve: 1) Utilize multi-modal information to model the long document images and reduce the impact of noises automatically; 2) Utilize dense retrieval methods (Karpukhin et al., 2020b;Qu et al., 2021b;Ren et al., 2021) to improve the document retrieval; 3) Pre-train multi-modal language models for long document images; and 4) Advanced methods to extract multi-span answers.

Conclusion
We propose an open-domain document visual question answering task to encourage scalable QA applications.We introduce DuReader vis to move toward the open-domain DocVQA research.There are three challenges: long document understanding, noisy text, and multi-span answer extraction.We propose a baseline and the results show that there is still a huge gap compared to human performance.We show the scalability of our approach by a zeroshot study.Finally, we give error cases and future directions.

A.1 Evaluation Metrics
We describe the metrics used in our experiments in details.
Recall@K: Recall@K is calculated as the proportion of questions where the top-k retrieved document images contain the answers.
MRR: Mean Reciprocal Rank (MRR) is the average of all question's reciprocal of the rank at the first retrieved relevant document image.
Rouge-L: Rouge-L uses LCS-based (Longest Common Subsequence-based) F-measure to estimate the similarity between the reference X of length m and the predication Y of length n.
F1: F1 measures the average overlap between the prediction and the reference, where we treat them as bags of tokens and compute their F1.We report the average over the F1 values of all questions.

A.2 Case Study
Here we give some results and show the performance of our proposed baseline, as shown in Figure 5,6,7.In Figure 5, the answers are multi table cells which are not continuous.We can see that our model can predict the right answer by utilizing the layout information, while Hi-RoBERTaXLM cannot predict any answers since the information in the pages are difficult to model for Hi-RoBERTaXLM.
In Figure 6, the answer has highlight layout and visual information (the font is big and the color is red).It is easy to be captured by our model, while it is hard for Hi-RoBERTaXLM to predict the right answer.
In Figure 7, the answer is a single-span text, and the question is highlighted in the document image.Hi-RoBERTaXLM also predicts an answer, but the answer is not complete compared with the ground truth.The answer predicted by Hi-LayoutXLM is better.We can see that the layout can help locate the right answers.
From the above three cases, we can see that our approach can utilize layout and visual information to model multi-span answers much better than the baseline Hi-RoBERTaXLM.

Figure 1 :
Figure 1: The procedure comparison between opendomain QA and open-domain DocVQA.
(What should I do if the Bluetooth accessories cannot be connected to the iphone?)A: •检查蓝牙配件是否兼容iOS 设备； (• Check whether the Bluetooth accessories are compatible with iOS devices;)•确认蓝牙配件是否开机； (• Confirm whether the Bluetooth accessories are turned on;) ......

Figure 2 :
Figure 2: An example in DuReader vis .Since the original document image is too large, we only show a part of it and indicate the answer by the red bounding box.The answer is a list in the example.

Figure 3 :
Figure3: The stages of DocVQA.Taking a question and a document image, we utilize OCR to extract texts and layouts from the document image and split them into m paragraphs with an average length of 512.DocVQA contains three stages: 1) Paragraph encoders using LayoutXLM to encode the input texts, layouts, and visual features, where the visual features are extracted by Mask-RCNN.All paragraph encoders share the same parameters; 2) Document encoder to further encode documents by combining all paragraph encodings in document images together; 3) Answer extractor applies a CRF layer to label multi-span answers with the BIOES label format.

Figure 4 :
Figure 4: Length analysis on the development set of DuReader vis .#Samples: the number of samples.
Figure 1(b)) and the well-designed format-specific open-domain QA system which extracts text contents from web pages with well-designed text extractors (Figure 1(a)).The results on DuReader vis are shown in Table 6.Open-domain QA performs better for two reasons: 1) The textual contents are clean with little web-page noise.2) The extracted contents only contain clean text, making the whole input shorter.The results show that there still leaves room for open-domain DocVQA to improve.It is of great value for researchers to push open-domain DocVQA to obtain competitive results with taskspecific or format-specific methods to reduce taskspecific or format-specific efforts.

Figure 5 :
Figure 5: Atable-type answer example with multiple table cells as the answer.The red bounding box indicates the answer.
duce DuReader vis , the first Chinese open-domain DocVQA dataset, to promote the studies in Open-Domain DocVQA.Specifically, we collect questions and document images from Baidu Search 2 .

Table 1 :
The comparison between DuReader vis and existing DocVQA datasets.# denotes "the number of".

Table 2 ,
the average length of the answers in DuReader vis is 180.54,undoubtedly longer than the factoid answers in DocVQA (2.43), VisualMRC (9.53), and InfographicVQA (1.60).The long and multi-span answers make it the third challenge in DuReader vis .

Table 3 :
Experimental results on the DocVQA task in DuReader vis (dev/test)."Text","List", and "Table"are the answer types, and "All" denotes the whole set.

Table 4 :
Experimental results on the DocVRE task in DuReader vis (dev/test).

Table 5 :
Experimental results on the Open-domain DocVQA task in DuReader vis (dev/test).

Table 6 :
Comparisons between Open-domain QA and Open-domain DocVQA (dev/test).The metric for retrieval is MRR.The metric for reader is Rouge-L.
Figure6: A table-type answer example with a single table cell as the answer.The red bounding box indicates the answer.