Entity Tracking via Effective Use of Multi-Task Learning Model and Mention-guided Decoding

Cross-task knowledge transfer via multi-task learning has recently made remarkable progress in general NLP tasks. However, entity tracking on the procedural text has not benefited from such knowledge transfer because of its distinct formulation, i.e., tracking the event flow while following structural constraints. State-of-the-art entity tracking approaches either design complicated model architectures or rely on task-specific pre-training to achieve good results. To this end, we propose MeeT, a Multi-task learning-enabled entity Tracking approach, which utilizes knowledge gained from general domain tasks to improve entity tracking. Specifically, MeeT first fine-tunes T5, a pre-trained multi-task learning model, with entity tracking-specialized QA formats, and then employs our customized decoding strategy to satisfy the structural constraints. MeeT achieves state-of-the-art performances on two popular entity tracking datasets, even though it does not require any task-specific architecture design or pre-training.


Introduction
Pre-trained language models have revolutionized the NLP field in recent years (Devlin et al., 2019;Liu et al., 2019;Brown et al., 2020) and also become more versatile with the novel encoderdecoder architecture (Raffel et al., 2020;Lewis et al., 2020), which allows them to handle different types of NLP tasks without further architectural changes. This versatility inherently facilitates cross-task knowledge transfer via multi-task learning (Raffel et al., 2020;Aribandi et al., 2022), and thus helps push the boundary of many popular NLP tasks such as question answering (Khashabi et al., 2020) and semantic parsing (Xie et al., 2022). However, entity tracking, which tracks the states and locations of an entity throughout the procedural Entity Tracking Input  MeeT Figure 1: Overview of MEET (Multi-task learningenabled entity Tracking). MEET utilizes the multi-task learning in T5 to boost entity tracking performance, with a customized decoding strategy addressing the structural constraints in state prediction (e.g., "move" cannot happen after "destroy").
text, like scientific processes or recipes, has not been impacted by this multi-task learning wave for two main reasons. First, entity tracking requires the model to make step-wise predictions while satisfying structural constraints (e.g., an entity cannot be "moved" after being "destroyed" in the previous steps). This requirement is usually tackled by designing task-specific architectures (Gupta and Durrett, 2019b;, and those generic multi-task models with the encoder-decoder architecture cannot address it easily. Second, understanding procedural text requires domain-specific knowledge, which usu-ally does not exist in general domain tasks that multi-task learning models are trained on, so it is not clear how effective the knowledge transfer will be given this domain gap . In this paper, we study how entity tracking can benefit from the current multi-task learning paradigm and present MEET, a Multi-task learning-enabled entity Tracking approach. This approach includes two parts. The first part finetunes T5 (Raffel et al., 2020), a model that has been pre-trained on a diverse set of NLP tasks and has shown great cross-task generalizability. Here, we design entity tracking-specialized QA formats to accommodate the need to make step-specific predictions, while facilitating effective knowledge transfer from T5. The second part resolves conflicted state predictions under structural constraints. We use a customized offline CRF inference algorithm, where the main idea is to emphasize the predictions of steps, in which the query entity is explicitly mentioned, because the fine-tuned model performs better in those cases (Table 5). On two benchmark datasets, ProPara (Dalvi et al., 2018) and Recipes , our MEET outperforms previous state-of-the-art methods, which require extra domain-specific pre-training or data augmentation. We verify the importance of multitask learning in T5 and our proposed decoding strategy through careful analyses and ablation studies.
To sum up, our contributions are three-fold: (1) Our work is the first to explore cross-task knowledge transfer for entity tracking on procedural text; (2) Our proposed approach, MEET, effectively uses the off-the-shelf pre-trained multi-task learning model T5 with a customized decoding strategy, and thus achieves state-of-the-art performance on two benchmark datasets; (3) Our comprehensive analyses verify the benefits of multi-task learning on entity tracking.

Related Work
Tracking the progression of an entity within procedural text, such as cooking recipes  or scientific protocols (Tamari et al., 2021;Le et al., 2022;, is challenging as it calls for a model to understand both superficial and intrinsic dynamics of the process. Recent work on entity tracking can be divided into two lines. One focuses on designing task-specific fine-tuning architectures to ensure that the model makes stepgrounded predictions while following the structural constraints. For instance, Rajaby Faghihi and Kordjamshidi (2021) introduce time-stamp embeddings into RoBERTa (Liu et al., 2019) to encode the index of the query step. Gupta and Durrett (2019b) frame entity tracking as a structured prediction problem and use a CRF layer to promote global consistency under those structural constraints. In our case, we show that, with QA formulation, simply appending the index of the query step to the question and indexing the procedure produces step-specific predictions. Moreover, we propose a customized offline CRF-decoding strategy for structural constraints to compensate for the fact that it is hard to jointly train T5, our backbone LM, with a CRF layer, like in previous methods.
The other line of work focuses on domainspecific knowledge transfer . Concretely, LEMON  achieves great performance by performing in-domain pretraining on 1 million procedural paragraphs. CGLI  shows that adding high-quality pseudo-labeled data (generated via self-training) during fine-tuning can also boost the model performance. In contrast, our work explores how entity tracking can benefit from out-of-domain knowledge via using off-the-shelf pre-trained multi-task learning models.

Method
In this section, we present MEET, a Multi-task learning-enabled entity Tracking approach. Here, we first review the problem definition, and then lay out the details of MEET.

Problem Definition
Entity tracking aims at monitoring the status of an entity throughout a procedure. The input of this task contains two items: 1) a procedural paragraph P , composed of a sequence of sentences {s 1 , s 2 , ..., s T }; and 2) a procedure-specific query entity e. Given the input, our goal is to predict the state and location of the query entity at each timestamp of the procedure (see an example from the ProPara dataset in Figure 1).

MEET
MEET includes two parts, task-specific fine-tuning with our proposed QA formats and the mention-guided conflict-resolve decoding.
Task-spefic Fine-tuning We formulate the two sub-tasks of entity tracking, state prediction and location prediction, as multi-choice and extractive QA problems respectively (see §4.2 for comparison with other task formulations), and fine-tune T5 to make independent predictions for every step in the procedure. Given a query entity e and procedure P , to predict the entity state at step t, the input sequence is formatted as the concatenation of the template question "What is the state of e in step t?", candidate states (e.g., create, move and destroy), and the full procedure with step index prepended. The output is just one of the candidate states. For location prediction, the input sequence is the concatenation of the question "Where is e located in step t?" and the indexed procedure, with the snippet "Other locations: none, unknown." appended. This is because entity locations sometimes are not explicitly mentioned in the procedure. The output is a text span, indicating the location of the query entity after step t. Examples of both tasks can be found in Appendix A.
Conflict-resolve Decoding Entity tracking places unique structural constraints on state predictions (e.g., move cannot happen after destroy). Similar to Gupta and Durrett (2019a), we run an offline CRF-decoding method (Viterbi decoding) to resolve conflicting state predictions. We initialize CRF transition scores T with the transition statistics in the training data, following . For example, T (p, q), the transition score between state p and q, is log(1/10) if there is only one p ⇒ q transition out of 10 transitions starting with the state p. We set the scores of all unseen transitions to −inf . As for CRF emission scores, we use the state prediction logits from T5. In contrast with previous methods, which treat each step equally, we weigh the emission scores differently, depending on whether the query entity e is explicitly mentioned in the step: where U ′ i represents the emission score of step i after weighing, and τ exp and τ imp are hyperparameters, determined by the grid search on the dev set. The intuition behind our approach is that, as the fine-tuned model performs better on "explicitly mentioned" steps (  . ProPara contains 488 scientific process-based procedural paragraphs (Figure 1), and Recipes includes 866 cooking recipes. Note that previous work experiments with different splits of the Recipes dataset; in this paper, we follow the split of  3 as it is used in most of the recent work . More dataset details are presented in Appendix B.
Evaluation ProPara performances are evaluated in two levels: sentence-level 4 (Dalvi et al., 2018) and document-level 5 (Tandon et al., 2018). Here, we focus on the document-level evaluation because it provides a comprehensive assessment of the model's understanding of the overall procedure and serves as the basis for the ProPara leaderboard rankings. The document-level evaluation is conducted by comparing the input/output entities and their transformations in the procedure with the gold answers. Further details regarding two evaluations and the result of the sentence-level evaluation can be found in Appendix C. For Recipes, following previous work , we evaluate the location changes of each ingredient throughout the recipe. 6 Baselines For ProPara, we compare MEET with the top five approaches on its leaderboard. Among these five approaches, DYNAPRO , TSLM (Rajaby Faghihi and Kordjamshidi, 2021), and CGLI  design taskspecific fine-tuning architecture using off-the-shelf LMs while KOALA  and LEMON  develop in-domain LMs for procedural text. For Recipes, as mentioned previously, we compare MEET with methods that experiment on the same data split of . We refer readers to the corresponding paper of each baseline for further details.
Implementation Details Our approach MEET is implemented using Huggingface Transformers (Wolf et al., 2020). Given the limited computational resources, we choose T5-large as the backbone of our MEET. The fine-tuning process employs the AdamW optimizer with a learning rate of 1 × 10 −4 and a batch size of 16. To resolve any potential conflict between state prediction and location prediction, we apply the rules designed in  to integrate the output from both tasks.

Results
We present the test set results of ProPara and Recipes in Table 1 and   60.1 52.6 56.1 REAL  55.2 52.9 54.1 LEMON  56

Analysis & Ablation Study
Multi-task Learning To investigate the impact of T5's multi-task learning process on entity tracking, we experiment with two variants of T5 as the backbone of MEET: 1) T5-v1.1, 9 a T5-like LM (with slight architecture changes) whose pretraining does not include any supervised tasks; 2) T5-v1.1 QA-FT , the resulting LM after fine-tuning T5-v1.1 on the three QA datasets, 10 which T5 is pre-trained on. The performance of the three LMs (T5-large size) on the ProPara dev set is presented in the top section of Table 3. We can see that T5 outperforms T5-v1.1 by a large margin, verifying that multi-task learning on out-of-domain nonentity-tracking tasks can benefit entity tracking. In addition, the advantage of T5 over T5-v1.1 QA-FT indicates that knowledge transfer can cross the task boundaries with T5's encoder-decoder architecture.
Task Formulation We compare our QA formulation with two other task formulations, proposed in recent work, for T5. The first formulation is called "step-input" (Gupta and Durrett, 2019a; , where each pair of the query entity e and procedure step t is formulated as one instance. Here, the state prediction is formulated as a classification problem, where the entity name is appended to the input, and no candidate answers are provided. Moreover, the procedure is trimmed until step t to specify the step index in the input. The second formulation is called "process-input" Gupta and Durrett, 2019b), where the model predicts entity states or locations in all steps in one instance. The input is the concatenation of entity e and the full procedure, and the model decodes 9 https://huggingface.co/docs/transformers/ model_doc/t5v1.1 10 The three datasets include MultiRC (Khashabi et al., 2018), ReCoRD (Zhang et al., 2018)  entity states and locations in all steps sequentially. The results of two new formulations are presented in the middle of Table 3. Our proposed QA formulation outperforms the other two formulations by a large margin. Detailed analyses of formulation comparison can be found in Appendix D.

Decoding Strategy & Model Size
The ablation study on decoding strategy and model size is shown at the bottom section of Table 3. Clearly, our proposed "mention-guided" decoding strategy, as well as using a larger LM as the backbone, contribute to the success of MEET.

Conclusion
We presented MEET, a T5-based entity tracking approach. This approach includes our newly proposed QA fine-tuning formats and a customized decoding strategy so that it can effectively encode the flow of events in the procedural text while following structural constraints. The state-of-the-art performances on two benchmark datasets demonstrate the effectiveness of MEET, and further analyses verify that multi-task learning on out-of-domain tasks can be beneficial for entity tracking.

Limitations
This paper demonstrates that multi-task learning on a combination of general domain datasets can effectively improve the model's understanding of the procedural text. However, the precise source dataset responsible for this improvement remains uncertain, making it an avenue for future research to investigate more efficient knowledge transfer through the identification of the most pertinent source dataset. Moreover, the pipeline structure of MEET may limit its practical utilization. As such, future work could consider incorporating our proposed mention-guided decoding strategy into the end-to-end training of the multi-task learning model.    , each ingredient has two possibles states (Exist or Absence) in each step of the recipe. Full data statistics on two datasets are presented in Table 4.

C Evaluation
Sentence-level evaluation This evaluation measures the following questions for each target entity: • Cat-1: Is entity created (destroyed, moved) in the process?
Further, the F 1 scores of the three questions are aggregated with micro/macro averages.

Document-level evaluation
It measures the four questions below for each paragraph: • What are the input entities to the process?
• What are the output entities of the process?
• What entity movements occur, when, and where?
The macro average of the F 1 scores of these four questions will be used as the final score. Table 6 provides a comprehensive comparison of past work on the ProPara dataset, including both document-level and sentence-level evaluations.

D Analysis of Formulation Comparison
When compared with the "step-input" formulation, the QA formulation allows the model to have the full context, and may take better advantage of LM's pre-training scheme (Li et al., 2019;Nagata et al., 2020). The "process-input" formulation works the worst in this comparison. With qualitative analyses, we find that it suffers from error propagation due to its autoregressive decoding, so future work may explore incorporating structural decoding (Tandon et al., 2018) into T5.