Another Dead End for Morphological Tags? Perturbed Inputs and Parsing

The usefulness of part-of-speech tags for parsing has been heavily questioned due to the success of word-contextualized parsers. Yet, most studies are limited to coarse-grained tags and high quality written content; while we know little about their influence when it comes to models in production that face lexical errors. We expand these setups and design an adversarial attack to verify if the use of morphological information by parsers: (i) contributes to error propagation or (ii) if on the other hand it can play a role to correct mistakes that word-only neural parsers make. The results on 14 diverse UD treebanks show that under such attacks, for transition-and graph-based models their use contributes to degrade the performance even faster, while for the (lower-performing) sequence labeling parsers they are helpful. We also show that if morphological tags were utopi-cally robust against lexical perturbations, they would be able to correct parsing mistakes.


Introduction
The use of morphological tags was a core component of dependency parsers to improve performance (Ballesteros and Nivre, 2012).With the rise of neural models, feeding explicit morphological information is a practice that has greatly vanished, with (often) the exception of part-of-speech (PoS) tags.In this line, Ballesteros et al. (2015) already found that character-based word vectors helped improving performance over purely word-level models, specially for rich-resource languages, for which the use of morphological information is more relevant (Dehouck and Denis, 2018).Related, Dozat et al. (2017) showed that predicted PoS tags still improved the performance of their graph-based parser, even when used together with character-based representations.Smith et al. (2018) and de Lhoneux et al. (2017) studied the impact that ignoring PoS tag vectors had on the performance of a biLSTM transition-based parser (Kiperwasser and Goldberg, 2016).They conclude that when considering PoS tags, word-level, and character-level embedddings, any two of those vectors are enough to maximize a parser performance, i.e., PoS tag vectors can be excluded when using both word-level and characterlevel vectors.Zhou et al. (2020) showed the utility of PoS tags when learned jointly with parsing.Recently, Anderson and Gómez-Rodríguez (2021) and Anderson et al. (2021) have explored the differences between using gold and predicted PoS tags, showing that the former are helpful to improve the results, while the latter are often not, with the exception of low-resource languages, where they obtain small but consistent improvements.Furthermore, Muñoz-Ortiz et al. (2022) showed that the efficacy of PoS tags in the context of sequence labeling parsing is greatly influenced by the chosen linearization method.
However, most of such work has focused on: (i) studying the effect of the universal PoS tags (Zeman et al., 2021), and (ii) its impact on nonperturbed inputs.Yet, NLP models are very sensible and brittle against small attacks, and simple perturbations like misspellings can greatly reduce performance (Ebrahimi et al., 2018;Alzantot et al., 2018).This has been shown for tasks such as named-entity recognition, question answering, semantic similarity, and sentiment analysis (Moradi and Samwald, 2021).In parallel, defensive strategies have been tested to improve the robustness of NLP systems, e.g., placing a word recognition module before downstream classifiers (Pruthi et al., 2019), or using spelling checks and adversarial training (Li et al., 2019).Yet, as far as we know, no related work has been done on testing perturbed inputs for parsing and the effect, positive or negative, that using morphological information as explicit signals during inference might have in guiding the parsers.1

Adversarial framework
Perturbed inputs occur for several reasons, such as for instance on-purpose adversarial attacks (Liang et al., 2018) or, more likely, unintended mistakes made by human writers.In any case, they have an undesirable effect on NLP tools, including parsers.Our goal is to test if under such adversarial setups, coarse-and fine-grained morphological tags: (i) could help obtaining more robust and better results in comparison to word-only parsers (going against the current trend of removing any explicit linguistic input from parsers); or (ii) if on the contrary they contribute to degrade parsing performance.

Perturbed inputs
To perturb our inputs, we use a combination of four adversarial misspellings, inspired by Pruthi et al. (2019) who designed their method relying on previous psycholinguistic studies (Davis, 2003;Rawlinson, 1976).In particular, we consider to: (i) drop one character, (ii) swap two contiguous characters, (iii) add one character, and (iv) replace a character with an adjacent character in a QWERTY keyboard.These changes will probably transform most words into out-of-vocabulary terms, although some perturbations could generate valid tokens (likely occurring in an invalid context).We only apply perturbations to a fraction of the content words of a sentence2 (details in §3), as function words tend to be shorter and a perturbation could make them unrecognizable, which is not our aim.
Finally, we only allow a word to suffer a single attack.Since we will be evaluating on a multilingual setup, we considered language-specific keyboards to generate the perturbations.We restrict our analysis to languages that use the Latin alphabet, but our adversarial attack would be, in principle, applicable to any alphabetic script.

Parsing models
Since we want a thorough picture of the impact of using morphological information on parsers, we include three models from different paradigms: Gómez-Rodríguez, 2019).It uses biLSTMs (Hochreiter and Schmidhuber, 1997) to contextualize the words, and the outputs are then fed to a pointer network (Vinyals et al., 2015), which keeps a stack and, in a left-to-right fashion, decides for each token its head.
2. A biaffine graph-based parser (Dozat et al., 2017).This model also uses biLSTMs to first contextualize the input sentence.Differently from Fernández-González and Gómez-Rodríguez, the tree is predicted through a biaffine attention module, and to ensure wellformed trees it uses either the Eisner (1996) or Chu (1965); Edmonds (1968) algorithms.3 3. A sequence labeling parser (Strzyz et al., 2020) that uses a 2-planar bracketing encoding to linearize the trees.Like the two other parsers, it uses biLSTMs to contextualize sentences, but it does not use any mechanism on top of their outputs (such as biaffine attention or a decoder module) to predict the tree (which is rebuilt from a sequence of labels).
Particularly, we use this third model to: (i) estimate how sensitive raw biLSTMs are to attacks, (ii) compare their behavior against the transitionand graph-based models and the extra mechanisms that they incorporate, (iii) and verify if such mechanisms play a role against perturbed inputs.
Inputs We concatenate a word vector, a second word vector computed at character level, and (optionally) a morphological vector.This is the preferred input setup of previous work on PoS tagging plus its utility for neural UD parsing (de Lhoneux et al., 2017;Anderson and Gómez-Rodríguez, 2021). 4Note that character-level vectors should be robust against our attacks, but it is known that in practice they are fragile (Pruthi et al., 2019).In this respect, our models use techniques to strengthen their behaviour against word variation, by using character-level dropout.This way, we inject noise during training and give all our models a lexical-level defensive mechanism to deal with misspellings.We kept this feature to keep the setup realistic, as character-level dropout is implemented by default in most of modern parsers, and ensure stronger baselines.
Training and hyperparameters We use nonperturbed training and development sets, 5 since our aim is to see how parsers trained in a standard way (and that may use explicit morphological features) behave in production under adversarial attacks.Alternatively, we could design additional techniques to protect the parsers against such perturbations, but this is out of the scope of this paper (and for standard defensive strategies, we already have character-level dropout).For all parsers, we use the default configuration specified in the corresponding repositories.We use 2 GeForce RTX 3090 for training the models for around 120 hours.

Morphological tags
To predict them, we use a sequence labeling model with the same architecture than the one used for the sequence labeling parser.We use as input a concatenation of a word embedding and a character-level LSTM vector.

Experiments
We now describe our experimental setup: Data We selected 14 UD treebanks (Zeman et al., 2021) that use the Latin alphabet and are annotated with universal PoS tags (UPOS), languagespecific PoS tags (XPOS), and morphological feats (FEATS).It is a diverse sample that considers different language families and amounts of data, whose details are shown in Table 1.For the pre-trained word vectors, we rely on Bojanowski et al. (2017). 6 Also, note that we only perturb the test inputs.Thus, when the input is highly perturbed, the model will mostly depend on the character representations, and if used, the morphological tags fed to it.
Generating perturbed treebanks For each test set, we create several versions with increasing percentages of perturbed content words (from 0% to 100%, with steps of 10 percent points) to monitor 5 For the models that use morphological information we went for gold tags for training.The potential advantages of training with predicted PoS tags vanish here, as the error distribution for PoS tags would be different for non-perturbed (during training) versus perturbed inputs (during testing). 6We exclude experiments with BERT-based models for a few reasons: (i) to be homogeneous with previous setups (e.g.Smith et al. (2018), Anderson et al. (2021)), (ii) because the chosen parsers already obtain competitive results without the need of these models, and (iii) for a better understanding of the results, since it is hard to interpret the performances of individual languages while not extracting conclusions biased on the language model used, instead of the parsing architecture.how the magnitude of the attacks affects the results.
For each targeted word, one of the four proposed perturbations is applied randomly.To control for randomness, each model is tested against 10 perturbed test sets with the same level of perturbation.
To check that the scores were similar across runs, we computed the average scores and the standard deviation (most of them exhibiting low values).
Setup For each parser we trained four models: a word-only (word) baseline where the input is just the concatenation of a pre-trained word vector and a character-level vector, and three extra models that use universal PoS tags (word+UPOS), language-specific PoS tags (word+XPOS), or feats (word+FEATS).For parsing evaluation, we use labeled attachment scores (LAS).For the taggers, we report accuracy.We evaluate the models on two setups regarding the prediction of morphological tags: (i) tags predicted on the same perturbed inputs as the dependency tree, and (ii) tags predicted on non-perturbed inputs.Specifically, the aim of setup ii is to simulate the impact of using a tagger that is very robust against lexical perturbations.

Results
Tables 2 and 3 show the average LAS results across all treebanks and models for tags predicted on perturbed and non-perturbed inputs, respectively.Figures 1, 2, and 3 display the mean LAS difference between the word and the other model configurations, using tags predicted on both perturbed and non-perturbed inputs for each parser.

Results using morphological tags predicted on perturbed inputs
Figure ??.a shows the score differences for the transition-based parsers.The average difference between the baseline and all the models using morphological tags becomes more negative as the per- Table 3: Average LAS scores for all treebanks and degrees of perturbation for the word, word+UPOS, word+XPOS, and word+FEATS models using morphological tags predicted on non-perturbed input.centage of perturbed words increases.Such difference is only positive for word+XPOS when none or a few percentage of words are perturbed.All morphological tags show a similar tendency, word+FEATS degrading the performance the most, followed by the 'coarse-grained' word+UPOS.
Figure 2.a shows the results for the graph-based parsers.Again, most morphological inputs contribute to degrade the performance faster than the baseline.In this case, no model beat the baseline when predicting tags on the perturbed inputs.The performance of word+FEATS and word+UPOS is similar (performing word+UPOS a bit better), and the word+XPOS models improve the performance the most.Figure 3.a shows the results for the sequence labeling parsers: differences between the baseline and the models utilizing morphological information exhibit minor changes ranging from 0% to 100% of perturbed words.Also, the usefulness of the morphological information depends on the specific tags selected.While word+UPOS obtains similar results to the baseline, word+XPOS scores around 2-3 points higher for the tested percentages of pertur- bations, and word+FEATS harms the performance in a range between 1 and 4 points.
The results show that feeding morphological tags to both graph-and transition-based parsers has a negative impact to counteract such attacks, degrading their performance faster.On the contrary, the sequence labeling parsers, that rely on biLSTMs to make the predictions, can still benefit from them.In addition, the different trends for the sequence labeling parser versus the transition-and graphbased parsers, which additionally include a module to output trees (a pointer network and a biaffine attention, respectively), suggest that such modules are likely to be more effective against adversarial attacks than explicit morphological signals.

Results using morphological tags predicted on non-perturbed inputs
As mentioned above, we use this setup to estimate whether morphological tags could have a positive impact if they were extremely robust against lexical perturbations (see also Figures 1.b, 2.b and 3.b).In the case of the transition-based parser, we observe that morphological tags predicted on non-perturbed inputs help the parser more as the inputs' perturbation grows, being word+XPOS the most helpful information, while UPOS and FEATS become useful only when sentences are perturbed over 20% (but they also become more and more helpful).The graph-based parser also benefits from the use of more precise tags: word+XPOS models beat the baseline when the perturbation is over 30%; and over 50% for word+UPOS and word+FEATS setups.Finally, for the sequence-labeling parser, morphological information from a robust tagger helps the model surpass the baseline for any percentage of perturbed words (except in the case of word+FEATS, when it only happens with perturbations over 20%).

Discussion on slightly perturbed inputs
Unintended typos are commonly found among users.For experiments with a small percentage of perturbed words (< 20%), transition-based parsers show improvement solely with the word+XPOS model, even when using non-robust taggers.Conversely, graph-based parsers do not benefit from morphological tags in this setup.Last, sequence labeling parsers benefit from incorporating XPOS and UPOS information, irrespective of the tagger's robustness, but not FEATS.

Differences across morphological tags
Averaging across languages, the language-specific XPOS tags have a better (or less bad, for setup i) behavior.These tags are specific to each language.The coarse-grained UPOS tags have a common annotation schema and tagset.This eases annotation and understanding, but offer less valuable information.For FEATS, the annotation schema is common, but in this case they might be too sparse.

Conclusion
This paper explored the utility of morphological information to create stronger dependency parsers when these face adversarial attacks at characterlevel.Experiments over 14 diverse UD treebanks, with different percentages of perturbed inputs, show that using morphological signals help creating more robust sequence labeling parsers, but contribute to a faster degradation of the performance for transition-and graph-based parsers, in comparison to the corresponding word-only models.

Limitations
Main limitation 1 The experiments of this paper are only done in 14 languages that use the Latin alphabet, and with a high share of Indo-European languages, with up to 4 Germanic languages.This is due to two reasons: (i) the scarcity of XPOS and FEATS annotations in treebanks from other language families, and (ii) the research team involved in this work did not have access to proficient speakers of languages that use other alphabets.Hence, although we created a reasonable diverse sample of treebanks, this is not representative of all human languages.
Main limitation 2 Although we follow previous work to automatically generate perturbations at character-level, and these are inspired in psycholinguistic studies, they might not be coherent with the type of mistakes that a human will make.In this work, generating human errors is not feasible due to the amount of languages involved, and the economic costs of such manual labour.Still, we think the proposed perturbations serve the main purpose: to study how morphological tags can help parsers when these face lexical errors, while the used method builds on top of most of previous work on adversarial attacks at character-level.

Figure 1 :
Figure1: Average ∆LAS across all treebanks for the transition-based models word+upos, word+xpos, and word+feats vs word, using morphological tags predicted on perturbed and non-perturbed inputs.

Figure 2 :
Figure2: Average ∆LAS across all treebanks for the graph-based models word+upos, word+xpos, and word+feats vs word, using morphological tags predicted on perturbed and non-perturbed inputs.

Figure 3 :
Figure3: Average ∆LAS across all treebanks for the sequence-labeling models word+upos, word+xpos, and word+feats vs word, using morphological tags predicted on perturbed and non-perturbed inputs.

Table 1 :
Relevant information for the treebanks used.