OpenPI-C: A Better Benchmark and Stronger Baseline for Open-Vocabulary State Tracking

Open-vocabulary state tracking is a more practical version of state tracking that aims to track state changes of entities throughout a process without restricting the state space and entity space. OpenPI is to date the only dataset annotated for open-vocabulary state tracking. However, we identify issues with the dataset quality and evaluation metric. For the dataset, we categorize 3 types of problems on the procedure level, step level and state change level respectively, and build a clean dataset OpenPI-C using multiple rounds of human judgment. For the evaluation metric, we propose a cluster-based metric to fix the original metric's preference for repetition. Model-wise, we enhance the seq2seq generation baseline by reinstating two key properties for state tracking: temporal dependency and entity awareness. The state of the world after an action is inherently dependent on the previous state. We model this dependency through a dynamic memory bank and allow the model to attend to the memory slots during decoding. On the other hand, the state of the world is naturally a union of the states of involved entities. Since the entities are unknown in the open-vocabulary setting, we propose a two-stage model that refines the state change prediction conditioned on entities predicted from the first stage. Empirical results show the effectiveness of our proposed model especially on the cluster-based metric. The code and data are released at https://github.com/shirley-wu/openpi-c


Introduction
State tracking is the task of predicting the states of the world after an action is performed.Most existing work operate under a simplified closevocabulary setting, assuming the state space and involved entities are known (Dalvi et al., 2018;Bosselut et al., 2018), which limits their applicability.The more practical open-vocabulary setting assumes both the entities and the state space are unknown.The OpenPI dataset (Tandon et al., 2020) is, to our knowledge, the first and only dataset for this task.However, we find a series of issues concerning data quality and evaluation, which may hinder progress in this line of research.
We identify three types of issues with the dataset: non-procedural documents, out-of-order steps, and ambiguous state changes.In particular, ∼32% of the state changes cannot be reliably inferred from the input, which we find encourages model hallucination.We filter out problematic data points and build a cleaner dataset via crowdsourcing.
For evaluation, the greedy matching strategy employed by Tandon et al. (2020) allows matching multiple predicted state changes to a single gold state change, inadvertently inflating the score when the model produces repetitive outputs.We propose a cluster-based metric that automatically merges repetitive stage changes and enforces 1-to-1 assignment between clusters.
We propose two enhancements to the seq2seq generation model proposed for this task in Tandon et al. (2020).To capture the dependency between world states of consecutive time steps, we introduce an entity memory to preserve information about the world state for all previous steps.When predicting the state changes for subsequent actions, the model can access the state information of previous time steps.Additionally, while close-vocabulary setting usually provides a list of involved entities to track, such a list is inaccessible in open-vocabulary setting.This requires the model to jointly identify involved entities and predict their state changes.To make the problem more tractable and help model learning, we propose an entity-conditioned prediction step where predictions are conditioned on each single entity extracted from the predictions of the first stage.

Related Work
Most existing work on entity state tracking (Weston et al., 2016;Dalvi et al., 2018;Bosselut et al., 2018) is closed-vocabulary, assuming that the number of possible states and involved entities is limited and known.Under this setting, state tracking can be modeled as a tagging problem (Gupta and Durrett, 2019;Amini et al., 2020;Huang et al., 2021)  The design of an external memory component has already been applied to close-vocabulary state tracking (Bosselut et al., 2018;Yagcioglu et al., 2018;Gupta and Durrett, 2019).However, they rely on known entities and only track a limited set of attributes.In this work, we use a dynamic memory that can handle emerging entities with open-vocabulary attributes.

Task and Dataset
The OpenPI dataset (Tandon et al., 2020) is, to our knowledge, the first and only dataset for openvocabulary state tracking.The texts are collected from WikiHow and the state changes are manually annotated.
Dataset Issues We identify 3 types of quality issues in the OpenPI dataset.For input, we find that ∼15% input texts are not procedure texts because the steps do show any temporal continuity (shown in Figure 2a).In valid procedure text inputs, ∼7.4% steps are invalid steps in the context of the procedure texts (shown in Figure 2b).They either do not explicitly describe an executable action, or do not follow the temporal order when combined with other steps.For output, ∼32% state changes cannot be reliably inferred from the input (shown in Figure 2c).Such data will encourage the trained model to generate hallucination.
To address these issues and improve data quality, we build a cleaned dataset named OpenPI-C through three-stage human cleaning: (1) filtering out non-procedure input texts, (2) filtering out invalid steps, and (3) filtering out unreliable state changes.In the three stages, we assign each data point with 3/3/2 annotators respectively and achieve 69.4%/84.9%/71.0%agreement (defined as the ratio of data points where all annotators agree with each other).To verify the annotation quality, we manually annotate 50 instances for each stage.90%/92%/84% of the crowd-sourcing annotations match our manual annotations for the three stages respectively.The statistics of the original OpenPI dataset and our OpenPI-C dataset are presented in Table 1.Detailed annotation settings and filtering criteria are in Appendix B. Though our dataset has fewer data samples, as shown in Figure 2, the removed data samples are mostly of low quality.As shown in Figure 4, including such samples in the dataset encourages hallucination and negatively impacts model performance.Evaluation Issues In Tandon et al. ( 2020), each predicted state is matched to the ground truth state with the highest similarity.As a result, when the model generates near-duplicate state changes, it will artificially boost the model's score.We propose a cluster-based metric to address this issue.We cluster the predicted set and the gold-standard set respectively based on Sentence-BERT (Reimers and Gurevych, 2019a) embedding similarity.After obtaining the predicted and gold-standard clusters, we assign a gold-standard cluster for each predicted cluster through maximal matching which enforces one-to-one mapping.Eventually, we use the assignment to calculate precision, recall and F1 scores.

Method
Generation Baseline As shown in Figure 1, the input to the model is the concatenation of the goal, steps, and a prompt "Now, what happens?".In Tandon et al. ( 2020), each state change will be represented as a templated sequence for generation.For example, (potato, shape, whole, cut in half) will be converted to "shape of potato was whole before and cut in half afterwards".

Entity Memory
To capture the temporal dependency across steps, we maintain a variable-size memory bank to store historical state changes.For each entity-attribute pair (e, a) that appears in the prediction, we allocate a memory slot after it first appears in the predicted state changes.Suppose it first appears at step k 0 , then we initialize its memory m at the next step m k 0 +1 = h k 0 .Here, h k 0 represents the hidden states for (e, a) at step k 0 .
In the subsequent steps, we update the memory every time the attribute a of entity e changes.Formally, at step k, k > k 0 , if (e, a) changes, then To compute h k , we take the text expressing its state change from the generated sequence at step k and compress their decoder-side hidden states h 1 , . . ., h n into h k via attention: where W k−k 0 is a learnable parameter for the (k − k 0 )-th step after (e, a) appears.To reduce the number of parameters, we share the same That is, we use W 0 to initialize the memory when (e, a) first appears, and use another parameter W >0 to update the memory.We incorporate the memory through the decoder side cross-attention.At step k, the keys and values for the cross-attention module include two parts: the encoder-side hidden states h enc 1 . . .h enc n (n refers to the number of tokens encoded by the encoder) and the memory vectors m k 1 . . .m k M (M refers to the number of created memory slots).We project them into key and value matrices K, V with different parameters: and feed them into the cross-attention module.In this way, the model can adaptively select between input information and historical state change information stored in the memory.

Experiments
Our experiments are based on pre-trained BART (Lewis et al., 2020). 1 We add another baseline 1 Our proposed techniques can be applied on any encoderdecoder model.Among the base models that we have experi-that that concatenates all previous state changes to the input (denoted as "BART + concat states").Following Tandon et al. ( 2020), we also use GPT-2 (Radford et al., 2019) as baseline.The main results are in Table 2. Overall, our proposed two techniques improve performance on most metrics especially on the cluster-based metrics.Compared to our proposed entity memory (EMem), "BART + concat states" takes the same information (historical steps and historical state changes) as input but significantly decreases the performance compared to the baseline.This is due to the historical state changes being too long and distracting the model.As in Figure 3, entityconditioned prediction (ECond) is able to produce more accurate outputs based on the same set of entities.We observe that performance gains brought by entity-conditioned prediction are more significant on cluster-based F1 metrics, because the baseline model produces longer and more repetitive outputs (average number of output state changes per step is 7.71 compared to 6.76 of BART+ECond).As a result, the original F1 gives the baseline too much credit.
To analyze the effect of dataset cleaning, we compare the outputs of models trained on the original dataset and cleaned dataset.As in Figure 4, the cleaned dataset encourages the model to stick to the input text and produce less hallucination.To quantify this effect, we manually examined 50 processes randomly sampled from the test set.Of the 50 processes we examined, each process consists of multiple steps, and each step has multiple output state changes.We did a binary classification on each output state change to classify whether it contains hallucinations or not.Overall, the model trained on OpenPI produced 749 hallucinated state changes while the model trained on OpenPI-C produced 393 (47.53% less).

Conclusion and Future Work
In this paper we study the open vocabulary state tracking problem.We build upon the generation formulation introduced by Tandon et al. ( 2020) and propose two techniques: (1) entity memory that models the temporal dependency by storing world states from previous steps, and (2) entityconditioned prediction that simplifies the task by predicting state changes conditioned on each single entity.We conduct human annotation to address data quality issues in the existing OpenPI dataset and thus propose a cleaned version of OpenPI dataset.We propose an improved cluster-based metric to overcome the original metric's preference towards repetition.For future work, we consider using external resources such as ConceptNet (Amigó et al., 2009) to assist entity prediction.

Limitations
The scope of this work is limited by the available data.The OpenPI dataset (Tandon et al., 2020) is derived from WikiHow2 , and focuses on everyday scenarios and contains English only.We would like to see resources that span more domains (e.g.scientific domains) and more languages.

Ethical Considerations
Our work does not involve the creation of new datasets.However, we would like to point out that the existing dataset OpenPI is based on WikiHow, which is primary crowdsourced (with partial expert review).Thus some of the content is influenced by the cultural and educational background of the annotators.In our human cleaning, we recruit annotators from United States and Canada regions only, which may also bring cultural bias to the content.In particular, some procedures are related to healthcare and neither the procedure nor the model output should be regarded as medical advice.
BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension.In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 7871-7880.Association for Computational Linguistics.

A Clustering Algorithm
For output clustering, we use stsb-distilroberta-base-v2 model provided by sentence-transformers package3 to obtain sentence embeddings.We use cosine Algorithm 1: The clustering algorithm.The input set y is the gold or the predicted set of state changes.Each output cluster C k is a subset of y and all output clusters C form a partition of y.

B Data Details
OpenPI dataset is released by Dalvi et al. (2018). 4t is an English-only dataset crawled from Wiki-How and annotated via crowd-sourcing.
As mentioned before, we conduct human annotation to filter out low-quality data.The annotation study is reviewed by an ethics review board and determined to be a not human subjects research.The annotation is conducted on MTurk platform.To ensure that the annotators are native English speakers, we recruit annotators from the United States and Canada.We have informed the annotators in the annotation instructions that we are collecting data for research purpose.The annotation includes three stages: Stage 1: filter out non-procedure texts.Each annotator is presented with an input text and asked to judge whether it is a procedure text or not.Each input text is annotated by three annotators; the reward for annotating each input text is $0.03.We remove input texts that are considered as non-procedure texts by most annotators (i.e., at least two annotators).15% procedure texts are removed at this stage.
Stage 2: filter out invalid steps.Each annotator is presented with an procedure text.For each step in the process, the annotator is asked to judge whether it is a valid step.Each input text is annotated by three annotators; the reward for annotating each input text is $0.2.We then remove steps that are considered as invalid steps by most annotators (i.e., at least two annotators).7.4% steps are removed at this stage.
Stage 3: filter out low-quality state changes.Each annotator is presented with an input procedure text and a state change caused by one of the steps.The annotator is asked to decide whether the state change is certain, uncertain and impossible.Each state change is annotated by two annotators; the reward for annotating each state change is $0.05.To ensure data quality, we remove state changes that receive at least one uncertain or impossible rating from the two annotators, which empirically yield the best results.32% state changes are removed at this stage.
Screenshots of the annotation interface are shown in Figure 5.Eventually, we manually examine the data and conducted rule-based filtering according to the following heuristics.We first remove steps with no state changes, and then remove procedure texts with < 3 steps.

C Experiment Details
We use GPT2-medium and BART-large models for the experiments.The number of parameters for GPT-2 baseline, BART baseline, BART+EMem and BART+ECond models are 355M, 406M, 444M and 406M respectively.Each experiment is run on one Telsa P100 GPU and takes about 4 hours.
In training, we use the exact training hyperparameters as Dalvi et al. (2018), i.e., the learning rate of 5 × 10 −5 , the batch size of 8, and 30 epochs.
In decoding, we use beam search with beam size of 4. The decoding strategy is searched from top-p sampling (0.5 ≤ p ≤ 0.9), top-k sampling (5 ≤ k ≤ 10) and beam search (beam= 4).The best decoding strategy is found by manual tuning on the original OpenPI dataset.Results are in Table 4.We show that using beam search significantly boost the performance over top-p or top-k sampling for all systems.We also show in Figure 6 that length penalty can be used to control the number of outputs, and thus to balance between precision and recall.
Compared to Dalvi et al. (2018), our reimplemented GPT-2 baseline is different in that: (1) we include the process goal g in the input, and (2) we use beam search with beam size of 4 instead of top-p sampling.
We also run the experiments on the original OpenPI dataset and compare with the results of Dalvi et al. (2018).Results are shown in Table 3.

D Scientific Artifacts
Scientific artifacts we use in this work include: (1) OpenPI dataset (Tandon et al., 2020) and their baseline and evaluation code released under the MIT License.The dataset is collected from WikiHow and focuses on every-day scenarios and contains English only.Our use is consistent with the resource's intended use, which is to facilitate research on open-vocabulary state tracking tasks.(2) Three pre-trained models: GPT-2 (Radford et al., 2019) and BART (Lewis et al., 2020) provided by transformers 5 and Sentence-BERT (Reimers and Gurevych, 2019b) provided by sentence-transformers, all licensed under the Apache License 2.0.We use the models for research which is consistent with their intended use.Our code and data are released under the MIT license, which is compatible with the artifacts utilized in our research.

Figure 1 :
Figure 1: The baseline generation model for open-vocabulary state tracking takes the goal and previous steps as input and generates the state changes as templated sentences.We propose to model temporal dependency between steps using an entity memory module and increase entity awareness of the model by using a two-stage procedure where the second state prediction is conditioned on entities from the first stage.
(a) The text is not a procedure text because the steps are not temporally related.(b)The first and second steps are invalid steps.The first step describes a pre-condition and is not executable.The second step provides complementary information and is not necessary to execute when combined with other steps.(c)The state change cannot be reliably inferred from step 2. Step 2 involves aiming and stabilizing the hand only, not the lungs.

Figure 2 :
Figure 2: Examples of low-quality data points removed during the filtering process.

Figure 3 :
Figure3: A good case of entity-conditioned prediction (ECond).Based on the same set of entities, entityconditioned prediction is able to correct the prediction for entities spray bottle and oil and choose more appropriate wording for water.

Figure 4 :
Figure 4: Outputs of our model (BART+EMem+ECond) trained on OpenPI and OpenPI-C respectively.The model trained on OpenPI produces more hallucination (highlighted in red).

Figure 5 :
Figure 5: Screenshots of the annotation interface.

Table 4 :
Results (in %) of GPT-2 baseline, BART baseline, and our proposed method with different decoding strategies on the original OpenPI dataset.We report clustering-based F1 with BLEU.Among settings, beam search achieves the best performance.

Figure 6 :
Figure 6: Precision and recall under different numbers of outputs for BART baseline.The length penalty is set as 0.2, 1.0 and 2.0 respectively.
which is not applicable for the open-vocabulary case.Tandon and Chatterjee (2022) proposed OpenPI dataset for the more practical open-vocabulary setting.They formulate the task as a generation problem to handle the open vocabulary challenge.

Table 1 :
Statistics of the original OpenPI dataset and our OpenPI-C dataset.
A challenge for this open-vocabulary task is the lack of access to the entities involved.Compared to directly modeling all state changes p(Y |x, g) given the steps x and goal g, we can decompose this problem into first predicting entities, and then modeling the state change of each entity separately p(Y e |x, g, e).Conditioning on the entity simplifies the task and eases model training.We reuse the baseline model and replace the natural language prompt with "Now, what happens to e?".During inference, we extract all the entities in the prediction and perform entity-conditioned prediction for each entity e. Eventually we merge the N sets of state changes as the final output.Exact BLEU ROUGE Exact BLEU ROUGE GPT-2 3.92 20.81 39.73 5.72 20.31 33.40 BART 4.88 23.35 41.88 7.10 22.72 35.44 +concat states 4.73 21.96 40.38 6.69 20.61 32.88 BART+EMem 5.27 24.06 42.71 7.65 23.40 35.79 +ECond 5.70 23.81 42.14 8.27 23.56 35.80 +EMem+ECond 5.65 23.73 42.15 8.26 22.96 35.34

Table 2 :
Main results on OpenPI-C (in %).EMem denotes Entity Memory and ECond denotes Entity-Conditioned prediction.
S(y i , y) > th then /* Assign y i to cluster C k */

Table 3 :
Results (in %) on the original OpenPI dataset.EMem denotes Entity Memory and ECond denotes Entity-Conditioned prediction.