Minimum Bayes’ Risk Decoding for System Combination of Grammatical Error Correction Systems

For sequence-to-sequence tasks it is challenging to combine individual system outputs. Further, there is also often a mismatch between the decoding criterion and the one used for assessment. Minimum Bayes’ Risk (MBR) decoding can be used to combine system outputs in a manner that encourages better alignment with the final assessment criterion. This paper examines MBR decoding for Grammatical Error Correction (GEC) systems, where performance is usually evaluated in terms of edits and an associated F-score. Hence, we propose a novel MBR loss function directly linked to this form of criterion. Furthermore, an approach to expand the possible set of candidate sentences is described. This builds on a current max-voting combination scheme, as well as individual edit-level selection. Experiments on three popular GEC datasets and with state-of-the-art GEC systems demonstrate the efficacy of the proposed MBR approach. Additionally, the paper highlights how varying reward metrics within the MBR decoding framework can provide con-trol over precision, recall, and the F-score in combined GEC systems. 1


Introduction
Ensembling, the combination of system outputs, is a powerful technique in deep learning, exploiting diverse model capabilities for robust predictions.Though numerous methodologies exist for system combination (Ganaie et al., 2021), when there is only access to model outputs, many methods are inapplicable and thus the simplest method becomes the averaging of model outputs.However, for sequence-to-sequence (seq2seq) systems, such as summarization, machine translation, and grammatical error correction (GEC), output averaging is less straightforward.A further challenge with seq2seq tasks is the mismatch between the decoding and assessment criteria.Kumar and Byrne (2004) proposed the utilization of Minimum Bayes' Risk (MBR) decoding as a means to select an output that minimizes the theoretical risk according to a designated reward metric.We propose a novel variant of MBR decoding for GEC to allow for system combination and give better alignment with the assessment criteria.
The nature of a GEC task permits the use of MBR decoding within the "edit"-space.Each output sequence can be represented as a set of "edits" required to transform the input sequence into the output.Consequently, the selection of a single output sequence for GEC can be achieved through MBR decoding with a reward function defined on the set of edits, aligned with the edit-based F-score typically used in GEC assessment criteria.Beyond selection, an additional technique known as maxvoting (Tarnavskyi et al., 2022) can be employed to combine different sets of edits.We propose an enhancement to the performance achieved through max-voting by treating the output sequences obtained from the combination as additional candidates for MBR decoding.Further, with a greedy MBR decoding algorithm, we explore the edit space to identify other candidate edit sets.Through experiments on three popular GEC datasets and use of state of the art GEC systems (Grammarly's GECToR (Omelianchuk et al., 2020)), we demonstrate that our MBR decoding approach in the edit space consistently leads to significant performance gains.Further, we also show that by selecting different reward metrics as part of the MBR decoding approach we can provide explicit control over precision, recall and the overall F-score used to assess GEC systems.

Related Work
Grammatical Error Correction: Early GEC systems using hand-crafted rules (Naber, 2003) were replaced by encoder-decoder architectures, using for example Recurrent Neural Networks (Cho arXiv:2309.06520v1 [cs.CL] 12 Sep 2023et al., 2014).Today, many state of the art GEC systems use Transformer-based (Vaswani et al., 2017) encoder-decoder architectures to perform the sequence-to-sequence GEC task (Kaneko et al., 2020;Chen et al., 2020;Kiyono et al., 2019;Lichtarge et al., 2020;Stahlberg and Kumar, 2020).However, LaserTagger (Malmi et al., 2019), the PIE model (Awasthi et al., 2019) and Grammarly's GECToR (Omelianchuk et al., 2020) are all able to achieve competitive performance using a sequence-to-edit structure for the overall sequence-to-sequence task, where a token can be tagged with edit operations.Once a set of tags have been defined, the edit operations can be applied to the input sequence to generate the grammatically correct output sequence.The GECToR system is particularly efficient at inference as it uses a Transformer encoder followed by softmax over linear layers for edit tag prediction, which is significantly faster than standard sequence-to-sequence GEC system decoders.Further, Wu et al. (2023) demonstrated that GECToR performs better than the most recent generative large language models, e.g.ChatGPT (Brown et al., 2020), which tend to over-correct, compromising on recall performance.Hence this work uses the GECToR model as its base GEC architecture.
System Combination for seqseq systems: Individual deep learning systems for classification tasks can be combined in many ways: stacking (Wolpert, 1992), negative correlation learning (Liu and Yao, 1999), max-voter schemes (Ju et al., 2018;Simonyan and Zisserman, 2014) or probability averaging (He et al., 2016;Raina et al., 2020;Szegedy et al., 2015).However, for generative language tasks such as GEC, where the output is a sequence of tokens, many traditional ensembling approaches are inapplicable.Sequence-level ensembling approaches, however, can address this by averaging conditional token level probabilities of multiple systems (Sennrich et al., 2015;Freitag et al., 2017;Malinin and Gales, 2021;Fathullah et al., 2021).However, this approach requires identical member architectures as well as access to the output probabilities of the predicted tokens.With the rising trend of limited black box access to large language models (e.g.ChatGPT (Liu et al., 2023)), system combination methods that only require the generated output sequences have practical benefit.
With access to only the output sequences from individual seq2seq systems, it is challenging to combine them into a single output.For automatic speech recognition, Sim et al. (2007) select a single output using a simple Minimum Bayes' Risk (MBR) decoding approach (Kumar and Byrne, 2004), where the aim is effectively to select the most average/representative output sequence.Similarly Manakul et al. (2023) use MBR to combine sequences for clinical document summarization.The MBR approach has also recently been applied to machine translation (Rosti et al., 2007a,b;Freitag et al., 2022;Müller and Sennrich, 2021;Zhang et al., 2022).For GEC systems, Tarnavskyi et al. (2022) propose a max voting scheme, where only edits predicted by the majority of individual systems are retained.We further improve GEC performance by applying MBR decoding to a sequence selection set augmented with sequences from max voting.We further enrich this selection space with a greedy search over edits.

Output Sequence Combination for GEC
A Grammatical Error Correction (GEC) system predicts a grammatically correct output sequence y from an input sequence, x.With multiple different GEC system output sequence predictions, Y = {y 1 , . . ., y N }, for the same input sequence, x, it is challenging to combine them into a single, best sequence.It is useful to consider the edit-space, where a set of edits, e n (x, y n ) = {e 1 , . . ., e |en| } can be used to represent each predicted output sequence, y n2 .A single edit in the edit set can be defined fully by an input token in x and an edit operation to apply (insertion, deletion or substitution).This section describes how Minimum Bayes' Risk decoding can be used in the edit-space to combine the different output sequences in Y.

MBR decoding for GEC
MBR decoding aims to select the most representative output sequence, y * ∈ Y.For GEC, we aim to maximise a reward score R in the edit-space that encourages better alignment with the final assessment metric, (1) where the reward score, R(ẽ, e), views ẽ as reference edits and e as the hypothesis/predicted edits.
In practice, it is difficult to meaningfully estimate the posterior distribution, p(ỹ|x) for each output sequence.Hence, we consider only similarly performing systems' output sequences, Y (c) ∈ Y to calculate the expectation of the reward and so we approximate each of these sequences as equiprobable, (2) where Y (s) represents the set of possible output sequences we want to select from.

MBR decoding with edit voting
Inspired by Tarnavskyi et al. (2022) the different edit sets, {e 1 , . . ., e N } associated with the different output sequences, can be combined to create a single edit set, e (m) containing all the individual edits present in at least m of the edit sets (i.e.m votes).This new combined edit set represents a new combined output sequence, y (m) .The MBR decoding approach of Equation 1 can now be applied by simply including the combined sequence in the set of sequences to select from, such that y (m) ∈ Y (s) .Note that the voting scheme can generate a maximum of N different combined sequences, with e (1) being the union of all edit sets and e (N ) the intersection.Hence the selection space of sequences Y (s) can be made richer with an extra N sequences.

Greedy MBR decoding for edit selection
Instead of augmenting the selection set Y (s) with only a few sequences, it is useful to consider all possible edit sets.However, it is computationally infeasible to consider every possible edit set.Hence, this work proposes a practical, greedy method to increase the richness of the selection set.The minimal edit set is arguably the intersection of all edit sets, e (N ) .In contrast the set of possible edits is given by the union set, e (1) .Hence, we can insert individual edits one by one from the union set to the intersection set.Every new edit insertion into the existing edit set represents a new output sequence y (that can be added to Y (s) ).However, we only retain the edit insertions that give a new output sequence that increases the MBR expected reward, ), e(x, y)) from Equation 2. This way we can efficiently search a richer selection set, Y (s) of output sequences to find the best combined output sequence y * .

MBR reward score
Equation 1 uses a reward score R(ẽ, e) to perform MBR decoding.Careful selection of the reward score allows for control over the desired metric to optimise.We can for example aim to combine systems in a manner that encourages better edit recall, Conversely, it may be desirable to have a system with high precision, However, it is usually desirable to have a GEC system with a good combination of precision and recall, as measured by a F-k score, As the precision is more important than recall for GEC systems, this work aligns the reward metric with the F0.5 score.The Jaccard Similarity reward metric is also explored as an alternative in Appendix A.

Experimental setup
We evaluate performance of the combined systems on three popular grammatical error correction corpora.First Certificate in English (FCE) corpus (Yannakoudakis et al., 2011)

Results
MBR decoding (Equation 2) is applied in the editspace for the three individual GECToR systems' outputs (b,r,x).Here, as the systems have similar performance (equiprobable posterior assumption valid), we let the selection set and the set of sequences to calculate the expected reward be the same  6) and F0.5 (Equation 5) oriented reward metrics give a significant increase in performance over the individual systems in Table 1.Although the recall reward (Equation 3) does not increase F0.5 performance, it does significantly increase recall performance.This demonstrates that a simple application of MBR decoding can be used to combine individual systems to improve performance and selection of the reward function gives specific control over precision and recall of the combined system.
Section 3.2 describes how MBR decoding can be applied to systems combined by a voting scheme in the edit space.Table 3 shows the performance of systems combined with voting, where an individual edit requires m votes (from b,r or x edit system predictions) to be included in the combined edit set, e (m) to form the single combined sequence y (m) .Note here that e (1) is the union 4 GEC performance for CoNLL and FCE is measured using the ERRANT tool (Bryant et al., 2017).Note that CoNLL is often evaluated with a different scorer in other papers.BEA is evaluated using the online submission portal: https:// codalab.lisn.upsaclay.fr/competitions/4057set and e (3) is the intersection and so these sequences encourage either a higher recall or precision respectively.Table 4 shows the impact of MBR decoding where all the separate voting sets (y (1) , y (2) , y (3) ) are included in the selection set, Y (s) = {b, r, x, y (1) , y (2) , y (3) }.Note that we maintain the same set of sequences for the expected reward calculation, Y (s) = {b, r, x} to ensure the equiprobable posterior assumption holds5 .It is evident that a richer selection set allows for even greater improvements in model performance for precision and F0.5 reward MBR decoding.Table 4: MBR with Y (c) = {b, r, x} and Y (s) = {b, r, x, y (1) , y (2) , y (3) }.
Finally, as described in Section 3.3, MBR decoding can be performed over a richer edit selection space by greedily adding individual edits to the intersection edit set, e (3) from the union edit set, e (1) .Experiments revealed (Appendix B) that allowing for all edits to be included from the union set can significantly increase the risk of poor insertions, compromising performance.Hence, instead we only consider edits from e (2) to be added to the intersection set e (3) .Table 5 demonstrates that MBR decoding over this richer set of sequences can give better performance (CoNLL) than MBR with voting, but does not always give the best performance (BEA and FCE have better performance in Table 4).This is perhaps because the expected reward over the individual systems (b,r,x) is not necessarily perfectly aligned with the final F0.5 score relative to the true reference edits used in evaluation and thus over-optimisation of the selection set for MBR decoding does not help performance for some datasets.  s) .

Conclusions
The combination of sequence-to-sequence grammatical error correction (GEC) systems is challenging.There is also often a mismatch between the decoding criterion and assessment criterion used for GEC systems.This work demonstrates that a novel Minimum Bayes' Risk (MBR) decoding approach within the edit-space can give an effective system combination method that aligns better with the assessment criteria.We further showed that enhancing the selection space to encompass sequences formulated by max-voting over individual edits can further improve system performance.Moreover, the employment of a greedy search strategy, guided by an MBR reward function, can result in performance gains for the combined system.Crucially, the choice of a reward function in the MBR framework gives users the ability to optimize desired characteristics of the combined GEC system, such as precision, recall or the F-score.

Limitations
This work explored how MBR decoding can be used to combine individual GEC systems, as well as align the combined system's performance to the edit-based F-score used to assess GEC systems.Experiments were performed with Grammarly's GECToR based systems.It would be useful to extend these experiments to other state of the art GEC systems.Although these other systems are not as efficient as GECToR due to the use of an auto-regressive Transformer decoder (as opposed to GECToR's encoder only structure), it is still meaningful to understand how these systems react to MBR decoding used for system combination.This is particularly relevant as generative large language models are increasingly used for standard natural language tasks.

Ethics Statement
This work reports on an efficient method to combine individual GEC system outputs in a manner that better aligns with assessment and improve performance.There are no perceived ethical risks associated with this work.
Table 2 compares the different reward functions, R, when applying MBR decoding.Selection with precision (Equation

Table 2 :
MBR with

Table 5 :
MBR with Y (c) = {b, r, x} and greedy search for Y