Towards Visual Question Answering on Pathology Images

Pathology imaging is broadly used for identifying the causes and effects of diseases or injuries. Given a pathology image, being able to answer questions about the clinical findings contained in the image is very important for medical decision making. In this paper, we aim to develop a pathological visual question answering framework to analyze pathology images and answer medical questions related to these images. To build such a framework, we create PathVQA, a VQA dataset with 32,795 questions asked from 4,998 pathology images. We also propose a three-level optimization framework which performs self-supervised pretraining and VQA finetuning end-to-end to learn powerful visual and textual representations jointly and automatically identifies and excludes noisy self-supervised examples from pretraining. We perform experiments on our created PathVQA dataset and the results demonstrate the effectiveness of our proposed methods. The datasets and code are available at https://github.com/UCSD-AI4H/PathVQA


Introduction
Pathology (Levison et al., 2012) studies the causes and effects of diseases or injuries. It underpins every aspect of patient care, such as diagnostic testing, providing treatment advice, preventing diseases using cutting-edge genetic technologies, to name a few. Given a pathology image, being able to answer questions about the clinical findings contained in the image is very important for medical decision-makings.
In this paper, we aim to develop a pathological visual question answering framework to analyze pathology images and answer medical questions related to these images. We first need to col- * Equal Contribution lect a dataset containing questions about pathology imaging. One possible way to create a pathology VQA dataset is crowdsourcing, which is used successfully for creating general domain VQA datasets (Malinowski and Fritz, 2014;Antol et al., 2015;Ren et al., 2015a;Johnson et al., 2017;Goyal et al., 2017). However, it is much more challenging to build medical VQA datasets than general domain VQA datasets via crowdsourcing. First, medical images such as pathology images are highly domain-specific, which can only be interpreted by well-educated medical professionals. It is rather difficult and expensive to hire medical professionals to help create medical VQA datasets. Second, to create a VQA dataset, one first needs to collect an image dataset. While images in the general domain are pervasive, medical images are very difficult to obtain due to privacy concerns.
To address these challenges, we resort to pathology textbooks, especially those that are freely accessible online, as well as online digital libraries. We extract images and captions from the textbooks and online digital libraries. Given these images, question-answer pairs are created based on image captions. These QA pairs are verified by medical professionals to ensure clinical meaningfulness and correctness. In the end, we created a pathology VQA dataset called PathVQA, which contains 32,795 questions asked from 4,998 pathology images. To our best knowledge, this is the first dataset for pathology VQA.
Given the pathology VQA dataset, the next step is to develop a pathology VQA system, which is also very challenging, due to the following reason. The medical concepts involved in PathVQA are very diverse while the number of question-answer pairs available for training is limited. Learning effective representations of these diverse medical concepts using limited data is technically difficult. Poorly learned representations lead to infe-Q1: What are dilated and congested? Q2: Are the sinuses dilated and congested? Q3: Is there increased fibrosis in the red pulp, capsule and the trabeculae? Q4: Where is increased fibrosis? Q5: Is gamna-gandy body also seen? Q1: What is slightly depressed on the surface? Q2: Where is the wedge-shaped infarct slightly depressed? Q3: Is the wedge-shaped infarct slightly depressed on the surface? Q4: What is on the surface? Q5: What is pale while the margin is haemorrhagic? Figure 1: Two exemplar images with generated questions. Both images have three types of questions: "what", "where", and "yes/no".
rior VQA performance. To address this challenge, we propose a three-level optimization framework which performs cross-modal self-supervised pretraining (Tan and Bansal, 2019) and VQA finetuning of a pathology image encoder and a question encoder end-to-end to learn powerful visual and textual representations jointly and automatically identifies and excludes noisy self-supervised examples from pretraining. Experiments on our developed PathVQA dataset demonstrates the effectiveness of our proposed methods. The major contributions of this paper are as follows: • We create a pathology visual question answering dataset -PathVQA, to foster the research of medical VQA. To our best knowledge, this is the first dataset for pathology VQA.
• We propose a three-level optimization framework which performs cross-modal selfsupervised pretraining and VQA finetuning of a pathology image encoder and a question encoder end-to-end to learn powerful visual and textual representations jointly and automatically identifies and excludes noisy selfsupervised examples from pretraining.
• On our PathVQA dataset, we demonstrate the effectiveness of our proposed method.

Cross-modal Self-supervised Learning
Cross-modal self-supervised learning learns representations for data with multiple modalities by solving cross-modal auxiliary tasks. VisualBERT (Li et al., 2019) learns representations for images and texts by implicitly aligning elements of a text and regions in an associated image with self-attention. CVLP (Shi et al., 2020) proposes an unbiased contrastive visual-linguistic pretraining approach, which constructs a self-supervised loss based on contrastive learning. ViLBERT (Lu et al., 2019) proposes to pretrain a vision-and-language BERT model through masked multi-modal modeling and alignment tasks, and then transfer the model to visual question answering tasks.  "Basic Pathology" (Robbins et al., 1981), and 3,328 pathology images collected from the PEIR 1 digital library.

Data Selection and Data Reweighting
The question-answer pairs are generated using a semi-automated pipeline with linguistic rules. Figure 1 shows some examples. On average, each image has 6.6 questions. The maximum and minimum number of questions for a single image is 14 and 1 respectively. The average number of words per question and per answer is 9.5 and 2.5 respectively. There are eight different categories of questions: what, where, when, whose, how, why, how much/how many, and yes/no. Table 1 shows the number of questions and percentage in each category. The questions in the first 7 categories are open-ended: 16,466 in total and accounting for 50.2% of all questions. The rest are close-ended "yes/no" questions. The questions cover various aspects of visual contents, including color, location, appearance, shape, etc. Such clinical diversity poses great challenges for AI models to solve this pathology VQA problem.

Method
We propose a three-level optimization based framework to perform VQA on PathVQA. In our framework, there are three learning stages, which are performed end-to-end jointly. In the first stage, self-supervised learning (He et al., 2019; Tan and Bansal, 2019) is performed to pretrain the image encoder and text encoder. In the second stage, we finetune the image encoder and text encoder on the PathVQA dataset. In the third stage, the trained model is validated on the validation set. In the first stage, we perform cross-modal self-supervised learning (Tan and Bansal, 2019) of an image en-coder W and a text encoder T . The image encoder is used to extract visual features of pathology images. The text encoder is used to extract semantic features of questions and answers. Self-supervised learning (He et al., 2019) is an unsupervised representation learning approach where pretext tasks are defined solely based on the input data, and representations are learned by solving these pretext tasks.
There are many ways to construct pretext tasks. In our work, following (Tan and Bansal, 2019), we define a simple yet effective pretext task: in the PathVQA dataset, given a pathology image and a question, judge whether this question is about this image. From the PathVAQ training set D, we create to perform the SSL task. There are M tuples, each containing a pathology image x from D and a question y from D. t i is a binary variable where t i = 1 if x and y are from the same training example in D and t i = 0 if otherwise. Given D , we develop a model to map (x i , y i ) to t i . In this model, an image encoder is used to encode x i and a text encoder is used to encode y i ; the concatenation of these two encodings is fed into a linear layer to predict whether the image matches with the question.
In self-supervised learning (He et al., 2019), the labels are typically constructed automatically without human supervision. As a result, they contain a lot of noises. For example, in D , t is determined simply based on whether x and y are from the training example in D. It is totally possible that a question y asked about an image x is appropriate to be a question for another image x as well if x and x are pathologically similar. In this case, the correct label t for (x, y) should be 1. However, it is set to 0 in D . Training the encoders using these noisy and incorrect labels may confuse the encoders and result in poor-quality representations.
To address this problem, we aim to develop a method to automatically identify incorrectly autolabeled examples in the training data of the SSL task. For each example (x, y, t) in D , we associate a selection variable a ∈ [0, 1] with it. If a is close to 1, it means this example is correctly labeled; if a is close to 0, it means this example is incorrectly labeled. Let l(f (x, y; W, T ), t) denote the SSL loss defined on (x, y, t), where f (x, y; W, T ) is the predicted probability that t = 1 and l(·) is the cross-entropy loss. We multiply a with l(f (x, y; W, T ), t) so that if (x, y, t) is incorrectly labeled, its loss will be down-weighted to 0 and effectively (x, y, t) is excluded from the SSL pretraining process. In the end, only correctly-labeled examples are used for pretraining the encoders. To this end, in the first stage, we solve the following optimization problem: In the second stage, we finetune the image encoder and text encoder in the VQA task defined on the PathVQA dataset D. Let V , U , R denote the network weights of the image encoder, text encoder, and QA network respectively. We train V , U , R by minimizing the VQA loss:

In this problem, the selection variables
is a training example in D, consisting of an input pathology image, an input question, and an output answer. When training V and U , we encourage them to be close to the optimally trained network weights W * (A) and T * (A) of the image and text encoder in the first stage, to transfer the representations learned in the SSL task to the VQA task. The second stage amounts to solving the following optimization problem: (1) where the L2 losses encourage V and U to be close to W * (A) and T * (A). γ 1 and γ 2 are tradeoff parameters.
In the third stage, we apply the optimally trained VQA model including V * (W * (A)), U * (T * (A)), and R * to make predictions on the validation dataset.
Then we learn the selection variables A by minimizing the validation loss Putting all these pieces together, we have the following three-level optimization framework:

VQA Models
Our proposed method can be applied to any VQA method. In this work, we choose two wellestablished and state-of-the-art VQA methods to perform the study while noting that other VQA methods are applicable as well.
• It extends the idea of co-attention to bilinear attention which considers every pair of multimodal channels.

Experimental Settings
Data split We partition the images in the PathVQA dataset along with the associated questions into a training set, validation set, and testing set with a ratio of about 3:1:1. In the PathVQA dataset, the frequencies of question categories are imbalanced. Because of this, during the partition process, we perform sampling to ensure the frequencies of these categories in each set to be consistent. In the end, there are 19,755 question-answer pairs in the training set, 6,279 in the validation set, and 6,761 in the testing set. , which measures the similarity of predicted answers and ground-truth by matching n-grams. Table 2 shows the VQA performance achieved by different methods. From this table, we make the following observations. First, for both Method 1 and Method 2, applying our three-level optimization based framework improves the performance. Our framework learns to identify and remove noisy and erroneous SSL training examples, which can avoid the model to be distorted by such bad-quality examples. Second, for both Method 1 and 2, applying cross-modal SSL (CMSSL) methods including CMSSL-IQ and CMSSL-IA improves the performance, which demonstrates the effectiveness of CMSSL. CMSSL uses auxiliary tasks, including judging whether an image matches with a question and judging whether an image matches with an answer, to learn semantic correspondence between image regions and words in questions/answers, which can improve the effectiveness of visual and textual representations for accurate VQA. It also learns image and text encoders by encourages the image and text encoders to solve auxiliary tasks, which reduces the risk of overfitting to the datadeficient VQA task on the small-sized training data.

Results
One may suspect how much information in images is used during the inference of the answers? Could it be possible that the models simply learn the correlations between questions and answers and ignore the images? In light of these concerns, we perform studies where the images are not fed into VQA models and only questions are used as inputs for inferring answers. Table 2 shows the results of not using images ("Method 1/2 without image"). As can be seen, for both Method 1 and 2, ignoring images leads to substantial degradation of performance. This shows that images in our dataset provide valuable information for VQA and PathVQA is a meaningful VQA dataset. The models trained on our datasets are not degenerated to simply capturing the correlation between questions and answers.

Conclusion
In this paper, we build a pathology VQA dataset -PathVQA -that contains 32,795 question-answer pairs of 8 categories, generated from 4,998 images. Majority of questions in our dataset are open-ended, posing great challenges for the medical VQA research. Our dataset is publicly available. To address the challenges that the self-supervised training data may contain errors and the effective representations of pathology images and questions are difficult to learn on limited data, we propose a threelevel optimization framework to automatically identify and remove problematic SSL training examples and learn sample-efficient visual and textual representations. Experiments on the PathVQA dataset demonstrate the effectiveness of our method.  Table 3 shows dataset split statistics. We implement the methods using PyTorch and perform training on four GTX 1080Ti GPUs. We basically follow the original model configurations used in (Tan and Bansal, 2019), (Kim et al., 2018), and(Yang et al., 2016). Data augmentation is applied to images, including shifting, scaling, and shearing. From questions and answers in the PathVQA dataset, we create a vocabulary of 4,631 words that have the highest frequencies.

715
In Method 1, we use the default hyperparameter settings in (Tan and Bansal, 2019). For the text encoder, the hidden size was set to 768. The image features were extracted from the outputs of the Faster-RCNN network, which is pretrained on BCCD 2 -a medical dataset containing blood cells photos, as well as on Visual Genome (Krishna et al., 2017). The initial learning rate was set to 5e-5 with the Adam (Kingma and Ba, 2014a) optimizer used. The batch size was set to 256. The model was trained for 200 epochs. In the SSL pretraining task on Method 1, we train a linear classifier with a dimension of 1,280 to judge whether an image matches with a question. In Method 2, words in questions and answers are represented using GloVe (Pennington et al., 2014) vectors pretrained on general-domain corpora such as Wikipedia, Twitter, etc. The image features are extracted from the outputs of the Faster-RCNN network pretrained on BCCD and Visual Genome. Given an image and a question, the model outputs an answer from a predefined set of answers. The dropout (Srivastava et al., 2014) rate for the linear mapping was set to 0.2 while for the classifier it was set to 0.5. The initial learning rate was set to 0.005 with the Adamax optimizer (Kingma and Ba, 2014b) used. The batch size was set to 512. The model was trained for 200 epochs. In the SSL pretraining task on Method 2, similar to that on Method 1, we train a linear classifier with a dimension of 1,280 to predict whether an image matches 2 https://public.roboflow.ai/object-detection/bccd with a question. We optimize the selection variables using the Adam optimizer, with an initial learning rate of 0.01. We set γ 1 and γ 2 to 0.3 and 0.7 respectively.

B Dataset Creation
We develop a semi-automated pipeline to generate a pathology VQA dataset from pathology textbooks and online digital libraries. We manually check the automatically-generated question-answer pairs to fix grammar errors. The automated pipeline consists of two steps: (1) extracting pathology images and their captions from electronic pathology textbooks and the Pathology Education Informational Resource (PEIR) Digital Library 3 website; (2) generating questions-answer pairs from captions.

B.1 Extracting Pathology Images and Captions
Given a pathology textbook that is in the PDF format and available online publicly, we use two third-party tools PyPDF2 4 and PDFMiner 5 to extract images and the associated captions therefrom. PyPDF2 provides APIs to access the "Resources" object in each PDF page where the "XObject" gives information about images. PDFMiner allows one to obtain text along with its exact location in a page. To extract image captions from text in each page, we use regular expressions to search for snippets with prefixes of " Fig." or " Figure" followed by figure numbers and caption texts. For a page containing multiple images, we order them based on their locations; the same for the captions. Images and locations are matched based on their order. Given an online pathology digital library such as PEIR, we use two third-party tools Requests 6 and Beautiful Soup 7 to crawl images and the associated captions.
Requests is an HTTP library built using Python and provides APIs to send HTTP/1.1 requests. Beautiful Soup generates the 'http.parser' and can access the urls and tags of the images on the website pages. Given a set of urls, we use Requests to read website pages and use Beautiful Soup to find images under the targeted HTML tags including the Content Division element div , the unordered list element ul , and the li element.  Then we can download images with Requests and write their captions directly to local files. Given the extracted image-caption pairs, we perform post-processing including (1) removing images that are not pathology images, such as flow charts and portraits; (2) correcting erroneous matching between images and captions.

B.2 Question Generation
In this section, we discuss how to semiautomatically generate questions from captions. Figure 2 shows the overall framework. We perform natural language processing of the captions using the Stanford CoreNLP (Klein and Manning, 2003) toolkit, including sentence split, tokenization, part-of-speech (POS) tagging, named entity recognition (NER), constituent parsing, and dependency parsing. Many sentences are long, with complicated syntactic structures. We perform sentence simplification to break a long sentence into several short ones. Given the subjects, verbs, clauses, etc. labeled by POS tagging and syntactic parsing, we rearrange them using the rules proposed in (Toutanova et al., 2007;Dorr et al., 2003) to achieve simplification. Figure 3 shows an example.
Given the POS tags and named entities of the simplified sentences, we generate questions for them: including "when"-type of questions for date and time entities and phrases such as "in/during ... stage/period", "before ...", and "after ..."; "how much/how many"-type of questions for words tagged as numbers; "whose" questions for possessive pronouns (e.g., "its", "their"); "where" questions for location entities and prepositional

Simple short sentences
Microscopy shows coagulative necrosis of the affected bowel wall and thrombosed vessels while the junction with normal intestine is indistinct and shows an inflammatory infiltrate.
Microscopy shows coagulative necrosis of the affected bowel wall and thrombosed vessels.

Long complex sentence
The junction with normal intestine is indistinct.
The junction shows an indistinct inflammatory infiltrate.

Rearrange subjects, verbs, clauses
Figure 3: Sentence simplification phrases starting with "inner", "within", "on the right/left of"; "how" questions for adjective words and phrases starting with "using", "via", "with", and "through", and "what" questions for the remaining noun phrases. Table 5 shows an example for each type of questions.
We use Tregex from Stanford CoreNLP tools (Manning et al., 2014), a tree query language including various relational operators based on the primitive relations of immediate dominance and immediate precedence, to implement the rules (Heilman and Smith, 2009) for transforming declarative sentences (captions) into questions.
To reduce grammatical errors, we avoid generating questions on sentences with adverbial clauses such as "chronic inflammation in the lung, showing all three characteristic histologic features". The question transducer mainly contains three steps. First, we perform the main verb decomposition based on the tense of the verb. For instance, we decompose "shows" to "does show". It is worth noting that for passive sentences with a structure of Type Original sentence Question

What
The end of the long bone is expanded What is expanded in the region of epiphysis.
in the region of epiphysis?

Where
The left ventricle is on the lower right Where is the left ventricle in this apical four-chamber view of the heart. in this apical four-chamber view of the heart? When After 1 year of abstinence, most scars are gone. When are most scars gone? How much/How many Two multi-faceted gallstones are present in the lumen. How many multi-faceted gallstones are present in the lumen?

Whose
The tumor cells and their nuclei are fairly uniform, The tumor cells and whose nuclei are fairly uniform, giving a monotonous appearance.
giving a monotonous appearance?

How
The trabecular bone forming the marrow space shows trabeculae How does the trabecular bone with osteoclastic activity at the margins.
forming the marrow space show trabeculae?  "be+shown/presented/demonstrated", we keep their original forms rather than performing the verb decomposition. Second, we perform subject-auxiliary inversion. We invert the subject and the auxiliary verb in the declarative sentences to form the interrogative sentence. After the inversion, the binary "yes/no" questions are generated. For instance, as shown in Figure 4, the sentence "microscopy shows coagulative necrosis of the affected bowel wall and thrombosed vessels" is inverted to "does microscopy show coagulative necrosis of the affected bowel wall and thrombosed vessels?". To generate questions whose answers are "no", we randomly select a phrase with the same POS tagging from other captions to replace the head words in the original question. For example, we replace "coagulative necrosis" in the sentence "does microscopy show coagulative necrosis of the affected bowel wall and thrombosed vessels" with other noun phrases. Third, we remove the target answer phrases and insert the question phrase obtained previously to generate open-ended questions belonging to types of "what", "where", "when", "whose", "how", and "how much/how many" as shown in Table 5. For instance, we transduce "microscopy shows coagulative necrosis of the affected bowel wall and thrombosed vessels" to "what of the affected bowel wall and thrombosed vessels does microscopy show?" as shown in Figure 4. Given the automatically generated questions which may contain syntactic and semantic errors, we perform post-processing to fix those issues. We manually proofread all questions to correct misspellings, syntactic errors, and semantic inconsistencies. The questions and answers are further cleaned by removing extra spaces and irrelevant symbols. Questions that are too short or vague are removed. Articles appearing at the beginning of answers are stripped.