FSUIE: A Novel Fuzzy Span Mechanism for Universal Information Extraction

Universal Information Extraction (UIE) has been introduced as a unified framework for various Information Extraction (IE) tasks and has achieved widespread success. Despite this, UIE models have limitations. For example, they rely heavily on span boundaries in the data during training, which does not reflect the reality of span annotation challenges. Slight adjustments to positions can also meet requirements. Additionally, UIE models lack attention to the limited span length feature in IE. To address these deficiencies, we propose the Fuzzy Span Universal Information Extraction (FSUIE) framework. Specifically, our contribution consists of two concepts: fuzzy span loss and fuzzy span attention. Our experimental results on a series of main IE tasks show significant improvement compared to the baseline, especially in terms of fast convergence and strong performance with small amounts of data and training epochs. These results demonstrate the effectiveness and generalization of FSUIE in different tasks, settings, and scenarios.


Introduction
Information Extraction (IE) is focused on extracting predefined types of information from unstructured text sources, such as Named Entity Recognition (NER), Relationship Extraction (RE), and Sentiment Extraction (SE).To uniformly model the various IE tasks under a unified framework, a generative Universal Information Extraction (UIE) was proposed in (Lu et al., 2022) and has achieved widespread success on various IE datasets and benchmarks.Due to the necessity of a powerful Figure 1: An example of annotation ambiguity for "vehicle" entity in sentence "On the 3rd September evening, I saw a yellow sports car drive past my house".
generative pre-training model for the generative UIE, the time overhead is extensive and the efficiency is not satisfactory.For this reason, this paper examines span-based UIE to unify various IE tasks, conceptualizing IE tasks as predictions of spans.
However, UIE models still have some limitations.First, as it is the process of training machine learning models to extract specific information from unstructured text sources, IE relies heavily on human annotation which involves labeling the data by identifying the specific information to be extracted and marking the corresponding span boundaries in the text manually.However, due to the complexity of natural language, determining the correct span boundaries can be challenging, leading to the phenomenon of annotation ambiguity.As shown in Figure 1, different annotated spans can be considered reasonable.In the span learning of UIE models, the method of teacher forcing is commonly used for loss calculation, making the model dependent on the precise span boundaries given in the training data.This can cause performance bottlenecks due to annotation ambiguity.
When the model structure in UIE places too much emphasis on the exact boundaries of IE tasks, it leads to insufficient utilization of supervision information.In order to predict span boundaries, positions closer to the ground-truth should be more accurate than those relatively farther away, as shown in Figure 1.For example, words close to the target "car" are more likely to be correct than the word "evening" which is farther away from the target.Under the premise of positioning to the span where "car" is located, both the "yellow car" and the "yellow sports car" can be regarded as vehicle entities.This means that the span model learned should be fuzzy rather than precise.
In addition, the use of pre-trained Transformer (Vaswani et al., 2017) in UIE to extract the start and end position representations also poses a problem.The Transformer model is designed to focus on the global representation of the input text, while UIE requires focusing on specific parts of the text to determine the span boundaries.This mismatch between the Transformer's focus on global representation and UIE's focus on specific parts of the text can negatively impact the performance of the model.
When there is a mismatch between the Transformer architecture and the span representation learning, the model may not make good use of prior knowledge in IE.Specifically, given the start boundary (end boundary) of the label span, the corresponding end boundary (start boundary) is more likely to be found within a certain range before and after, rather than throughout the entire sequence.This is a prior hypotheses that span has limited length, which is ignored in the vanilla UIE model.To address this, a fuzzy span attention mechanism, rather than fixed attention, should be applied.
In this paper, we propose the Fuzzy Span Universal Information Extraction (FSUIE) framework that addresses the limitations of UIE models by applying the fuzzy span feature, reducing overreliance on label span boundaries and adaptively adjusting attention span length.Specifically, to solve the issue of fuzzy boundaries, we design the fuzzy span loss that quantitatively represents the correctness information distributed on fuzzy span.At the same time, we introduce fuzzy span attention that sets the scope of attention to a fuzzy range and adaptively adjusts the length of span according to the encoding.We conduct experiments on various main IE tasks (NER, RE, and ASTE).The results show that our FSUIE has a significant improvement compared to the strong UIE baseline in different settings.Additionally, it achieves new state-of-the-art performance on some NER, RE, and ASTE benchmarks with only bert-base architecture, outperforming models with stronger pre-trained language models and complex neural designs.Furthermore, our model shows extremely fast convergence, and good generalization on lowresource settings.These experiments demonstrate the effectiveness and generalization of FSUIE in different tasks, settings, and scenarios.

FSUIE
In FSUIE, incorporating fuzzy span into base UIE model involves two aspects.Firstly, for the spans carrying specific semantic types in the training data, the boundary targets should be learned as fuzzy boundaries to reduce over-reliance on span boundaries.To achieve this, we propose a novel fuzzy span loss.Secondly, during the span representation learning, the attention applied in span should be dynamic and of limited length, rather than covering the entire sequence.To achieve this, we propose a novel fuzzy span attention.

Fuzzy Span Loss (FSL)
The introduction of FSL is a supplement to traditional teacher forcing loss (usually implemented as Cross Entrophy), to guide the model in learning fuzzy boundaries.The challenge for FSL is how to quantify the distribution of correctness information within the fuzzy boundary.Specifically, for a given label span S, conventional target distributions (one-hot) indicate the correct starting and ending boundaries.This form actually follows the Dirac delta distribution that only focuses on the ground-truth positions, but cannot model the ambiguity phenomenon in boundaries.
To address the challenge discussed above, we propose a fuzzy span distribution generator (FSDG).In our method, we use a probability distribution of span boundaries to represent the groundtruth, which is more comprehensive in describing the uncertainty of boundary localization.It consists of two main steps: 1) determining the probability density distribution function f ; 2) mapping from the continuous distribution to a discrete probability distribution based on f .Specifically, let q ∈ S be a boundary of the label span, then the total probability value of its corresponding fuzzy boundary q can be represented as follows:  R max are the start and end positions of the fuzzy range, q gt represents the ground-truth position for boundary q, and Q(x) represents the corresponding coordinate probability.
The traditional Dirac delta distribution can be viewed as a special case of Eq. ( 1), where Q(x) = 1 when x = q gt , and Q(x) = 0 otherwise.Through a mapping function F , we can quantifying continuous fuzzy boundaries into a unified discrete variable q = [F (q 1 ), F (q 2 ), • • • , F (q n )] with n subintervals.[q 1 , q 2 , • • • , q n ] represent continuous coordinates in fuzzy range where q 1 = R min and q n = R max , the probability distribution of each given boundary of the label span can be represented within the range via the softmax function.
Since the Dirac delta distribution only assigns non-zero probability to a single point, it is not suitable for modeling uncertainty or ambiguity in real-world data.Thus in FSUIE, we choose the Gaussian distribution N (µ, σ 2 ) as the probability density function f .Compared with other probability distributions, the Gaussian distribution assigns non-zero probability to an entire range of values has the following advantages: (1) it is continuous, symmetrical, and can well represent the distribution of correctness information within the fuzzy boundary including the gold position; (2) it is a stable distribution with fewer peaks and offsets, and can ensure that the correctness information is more concentrated on the gold position while distributed on the fuzzy boundary; (3) the integral of the Gaussian distribution is 1, which can ensure that the accuracy distribution after softmax is more gentle.
To get the discrete variable q , Four parameters are involved here: variance σ, mean µ, sampling step s, and sampling threshold θ.These parameters are used to control the range, peak position, and density of the fuzzy boundary, respectively.Specifically, the parameter µ is set to q gt and the Gaussian distribution is determined using a pre-determined σ.Assuming q g ∈ [q 1 , q 2 , • • • , q n ] = q gt , F can represented as: ( Given that values in the marginal regions of Gaussian distribution are quite small, the sampling threshold θ here acts as a filter to eliminate information from unimportant locations.The specific choice of parameters is discussed in the following experimental section.We use q as the distribution of correctness information on the fuzzy boundaries.The beginning and end fuzzy boundaries together make up the fuzzy span.Then, we calculate the KL divergence between the model's predicted logits and the gold fuzzy span distribution as the fuzzy span loss.The exact boundary and fuzzy boundary distribution is shown in Figure 2.This fuzzy span loss is then incorporated into the original teacherforcing loss function with a coefficient: where p represents the predicted distribution of the model and q represents the generated fuzzy span distribution from FSDG according to the annotation in training data.L ori is the original Binary Cross Entropy (BCE) loss of the model in UIE, and λ is the coefficient of the fuzzy span loss.

Fuzzy Span Attention (FSA)
We construct a FSA based on a multi-head selfattention mechanism with relative positional encoding (RPE), since RPE is more suitable for span representation learning with fuzzy bounds.In conventional multi-head attention with RPE, for a token at position t in the sequence, each head computes the similarity matrix of this token and the tokens in the sequence.The similarity between token t and token r can be represented as: where W k and W q are the weight matrices for "key" and "query" representations, y t and y r are the representations of token t and r, and p t−r is the relative position embedding, the corresponding attention weight can be obtained through a softmax function Conventional self-attention focus on global representations, mismatching the requirement of fuzzy spans.To address this issue, we present a novel attention mechanism, called Fuzzy Span Attention (FSA), to control attention scores of each token, aiming to learn a span-aware representation.The fuzzy span mechanism of FSA consists of two aspects: (1) the length of the range applying full attention is dynamically adjusted; and (2) the attention weights on the boundary of the full attention span are attenuating rather than truncated.Specifically, inspired by (Sukhbaatar et al., 2019), we design a mask function g m to control the attention score calculation.Assuming the maximum length of the possible attention span is L span , the new attention scores can be represented as: . (5) The following process is divided into two stages: (1) determining the attention changing function g a on the fuzzy span, and (2) constructing the mask function g m based on g a for span-aware representation learning.According to the characteristics of fuzzy span, we set g a as a monotonically decreasing linear function.To adjust the attention span length, we define a learnable parameter δ ∈ [0, 1].The g a (x) and corresponding g m (x) can be represented as follows: where l controls the length of the full attention range and d is a hyper-parameter that governs the length of the attenuated attention range.
In Figure 3, an illustration of the g m function is depicted.The dashed lines represent alternative choices of g a functions, such as Through experimentation, we found that the linear attenuated function performs best (refer to comparison in Appendix A).Iterative optimization of δ allows the model to learn the optimal attention span lengths for a specific task.It is important to note that different heads learn the attention span length independently and thus obtain different optimal fuzzy spans.In our implementation, instead of using multiple layers of fuzzy span attention layers, we construct the span-aware representation with a single fuzzy span attention layer on top of Transformer encoder, and it does not participate in the encoding process.Therefore, although the maximum range of fuzzy span attention is limited by L span , it only affects span decisions and does not have any impact on the representation of tokens in the sequence.
3 Experiments We trained both models for 50 epochs with a learning rate of 1e-5 on the datasets of each task, and selected the final model based on the performance on the development set.

Results on NER tasks
We report the results of NER task in Table 1.By comparing the results of our baseline UIE-base with other methods, it can be seen that UIE-base has achieved comparable results compared to other methods that use the same BERT-base architecture.It serves as a strong baseline to visually demonstrate enhancements made by FSL and FSA.By introducing FSL and FSA, our FSUIE-base achieves significant performance improvements over the UIE-base that does not have fuzzy span mechanism (+1.15, +1.59, +1.99 F1 scores).Our proposed FSUIE model shows the most significant improvement on the ADE dataset.This is primarily due to the smaller scale of training datasets in the ADE dataset, which allows the model to easily learn generalized fuzzy span-aware representations.This demonstrates the superiority of the FSUIE model.
FSL and FSA enable the model to reduce overdependence on label span boundaries and learn span-aware representations.When compared to existing NER models, FSUIE achieves new stateof-the-art performance on the ADE dataset even with the BERT-base backbone.FSUIE-large even achieve significant improvement (+1.42) on FSUIEbase.FSUIE-large also achieves comparable results on the ACE04 and ACE05 datasets, even when compared to models using stronger pre-trained language models such as ALBERT-xxlarge.Furthermore, our FSUIE demonstrates an advantage in terms of its structure prediction compared to the generative UIE model.As it does not require the generation of complex IE linearized sequences, our FSUIE-base, which only uses BERT-base as its backbone, outperforms the generative UIE model that uses T5-v1.1-large on the ACE05 dataset.

Results on RE tasks
In Table 2, we present the results of the RE tasks.
Compared to the baseline, UIE-base, which does not incorporate fuzzy span mechanism, our proposed FSUIE-base, which incorporates FSL and FSA, also achieves a significant improvement on the RE task using same backtone.Furthermore, when compared to the Table-Sequence Encoder approach (Wang and Lu, 2020), our method learns label span boundary distribution and span-aware representations, resulting in optimal or competitive results on the RE task even with FSUIE-base, despite using a simpler structure and smaller PLM backbones.
Compared to span-based IE models, our method outperforms the traditional joint extraction model by performing a two stage span extraction and introducing the fuzzy span mechanism.Specifically, on the ADE dataset, our method performs better than joint extraction methods using Bio-BERT, a domain-specific pre-trained language model on biomedical corpus, even using BERT-base as the pre-trained language model.This demonstrates that the fuzzy span mechanism we introduced can extract general information from the data, giving the model stronger information extraction capabilities, rather than simply fitting the data.
Compare to generative UIE models, our spanbased FSUIE reflects the reality of the structure of IE task and does not require additional sequence generation structures, achieving higher results with less parameters even with FSUIE-base.Compared to models that perform relation extraction using a pipeline approach, like PL-Marker, our FSUIE improves performance in both stages of the pipeline by introducing FSL and FSA.As a result, it results in an overall improvement in relation extraction.Additionally, our model achieves new state-of-theart results on ACE04 and ADE datasets,even using only BERT-base as the backbone, and on ACE05 dataset with FSUIE-large, compared to other models that use more complex structures.This demonstrates the model's ability to effectively extract information through our proposed method.

Results on ASTE tasks
In Table 3, we present the results of our experiments on the ASTE task.Due to the small scale of the ASTE-Data-V2, FSUIE-large is not needed to achieve better results, and this section only uses FSUIE-base for comparison.It can be seen that by introducing the fuzzy span mechanism, our FSUIE model significantly improves ASTE performance compared to the baseline UIE-base.This also demonstrates the effectiveness and generalization ability of FSUIE in IE tasks.Additionally, our FSUIE-base model achieves state-of-the-art results on three datasets (14lap, 15res, 16res) and demonstrates competitive performance on the 14res dataset.This indicates that the fuzzy span mechanism is effective in improving the model's ability to exploit and extract information, as well as its performance on specific tasks without increasing model parameters.
Furthermore, our FSUIE model has a relatively simple architecture, compared to other models, which shows that FSUIE is able to improve per-  For ASTE, span-based UIE models, as opposed to generative UIE models, can leverage the complete semantic information of the predicted aspect span to assist in extracting opinions and sentiments.The fuzzy span mechanism enhances the model's ability to exploit the semantic information within the fuzzy span, where possible opinions and sentiments reside, while ensuring span-aware representation learning, resulting in significant improvements.Furthermore, FSUIE is a reaction to the real structure of IEtask, avoiding the extra parameters that sequence generation structures bring, and therefore outperforms generative UIE models with fewer parameters.
We notice that FSUIE improves relatively less on the RE task compared to the ASTE task.In the RE task, the model has to learn different entities, different types of relationships, and binary matching skills.In contrast, in ASTE tasks, the model only needs to learn different entities, two relationships that differ significantly in semantics (opinion and sentiment), and ternary pairing tips.From this perspective, RE tasks are more challenging than ASTE tasks.

Results on Low-resource Settings
To demonstrate the robustness of our proposed FSUIE method in low-resource scenarios, we conducted experiments using a reduced amount of training data on ACE04 for NER and RE tasks, and 14res for ASTE task.Specifically, we created three subsets of the original training data at 1%, 5%, and 25% of the original size.In each lowresource experiment, we trained the model for 200 epochs instead of 50 epochs.The results of these experiments were compared between FSUIE-base and UIE-base and are presented in Table 4.
The results of the low-resource experiments further confirm the superior performance of FSUIE over UIE in handling low-resource scenarios.With only a small fraction of the original training data, FSUIE is still able to achieve competitive or even better performance than UIE.This demonstrates the robustness and generalization ability of FSUIE in dealing with limited data.Overall, the results of the low-resource experiments validate the ability of FSUIE to effectively handle low-resource scenarios and extract rich information through limited data.
We also found that the model both performed better on NER and ASTE taks than on RE task under low-resource settings.This is because NER and ASTE tasks are simpler than RE, so less data can bring better learning performance.Additionally, we noticed a small performance decrease in the ASTE task for the 100% set compared to the 25% set.This change may be due to the fact that the training data is unbalanced, and reducing the training size can alleviate this phenomenon.

Ablation Study
Since FSUIE has been verified to make more effective use of the information in the training set, in order to verify this, we verify it from the perspective of the model training process.Specifically, we recorded the effects of baseline UIE-base, UIE-base+FSL, UIE-base+FSA and full model FSUIEbase on different training steps on the NER ACE04 test set, and the results are shown in Figure 4. We noticed that the models with FSA have a significantly faster convergence speed, indicating that by learning span-aware representations, which are closer to the span prediction goal, the span learning process becomes more easy and efficient.With FSA, the model can focus its attention on the necessary positions and capture the possible span within a given sequence.While for FSL, it have a similar convergence trend with the baseline, thus may not improve the convergence speed.To further investigate the contribution of FSL and FSA to the improvement of model performance, we conduct ablation experiments on the NER task using the ADE dataset.The specific experimental results are shown in Table 5.It can be seen that the introduction of FSL alone can improve model performance individually.When using FSA alone, the performance of the model drops slightly.However, when both FSL and FSA are used together, the model is significantly enhanced.
From our perspectives, the separate introduction of FSA makes the model focus on specific parts of the sequence rather than global representation, resulting in a loss of information from text outside the span.This may explain the slight drop in performance when using UIE+FSA.However, this also demonstrates that in the IE task, sequence information outside a specific span has a very limited impact on the results.The introduction of FSL alleviates the model's over-dependence on label span boundaries, allowing the model to extract more information, resulting in an improvement in both settings.When FSA and FSL operate simultaneously, the model extracts more information from the text and FSA guides the model to filter the more critical information from the richer information, resulting in the most substantial improvement.

Visualization of FSA
To further examine the effectiveness of the fuzzy span mechanism, we visualized the attention distribution of the FSA layer in FSUIE-large as shown in Figure 5.It should be noted that FSA is only placed at the top layer for constructing span-aware representation and does not participate in the encoding process, thus only affects span decisions rather than the representation of tokens in the sequence.
The attention distribution indicates that, for a given input text, each token in the final encoding sequence tends to focus on semantic information within a limited range of preceding tokens rather than on the global representation of the input text.This aligns with our expectation for the design of the fuzzy span mechanism and confirms that fuzzy span mechanism does indeed guide the model for appropriate attention distribution for IE tasks.

Related Work
Universal Model Building universal model structures for a wide range of NLP tasks has been a hot research area in recent years.The focus is on building model structures that can be adapted to different sources of data, different types of labels, different languages, and different tasks.Several universal models have been proposed, such as models learning deep contextualized word representations (Peters et al., 2018;Devlin et al., 2019), event extraction models that can predict different labels universally (Lu et al., 2021), models that can handle multiple languages (Arivazhagan et al., 2019;Aharoni et al., 2019;Conneau et al., 2020), a universal fine-tuned approach to transfer learning (Howard and Ruder, 2018), models that learning syntactic dependency structure over many typologically different languages (Li et al., 2018;Sun et al., 2020) and models that can universally model various IE tasks in a unified text-to-structure framework (Lu et al., 2022).This paper builds upon the UIE by incorporating the fuzzy span mechanism to improve IE performance.
More related work about sparse attention please refer to Appendix B.

Conclusion
In this paper, we proposed the Fuzzy Span Universal Information Extraction (FSUIE) framework, an improvement for Universal Information Extraction.
To make use of boundary information in the training data and learn a decision-closer span-aware representation, we proposed a fuzzy span loss and fuzzy span attention.Extensive experiments on several main IE tasks show that our FSUIE has a significant improvement compared to the UIE baseline, and achieves state-of-the-art results on ADE NER datasets, ACE04 RE, ACE05 RE and ADE RE datasets and four ASTE datasets.The experiments also reveal FSUIE's fast convergence and good generality in low-resource settings.All the results demonstrate the effectiveness and generalizability of our FSUIE in information extraction.

Limitations
This paper are based on the assumption that Universal Information Extraction (UIE) models have limitations, particularly with regards to over-reliance on label span boundaries and inflexible attention span length.Therefore, the proposed framework may be computationally and spatially expensive as it requires a more complex attention mechanism and additional computing power for training.Nevertheless, this limitation of the span-based UIE model can be overlooked in comparison to that of the generative UIE model, which uses a stronger language model.Additionally, the probability density functions explored in FSL are limited; thus, further research is needed to develop a more targeted strategy for adjusting the correct information distribution.In Table 6, we present the performance of models using various g a functions in the FSUIE technique on the ADE NER test set, where g l a denotes the linear attenuated function employed in FSUIE.Compared to the UIE-base, which does not integrate the fuzzy span mechanism, all FSUIE-based models employing different g a functions obtain better results, thus illustrating the superiority of FSUIE.Regarding the different g a strategies, FSUIE-base (g a ) shows minimal enhancement.This is likely because the fuzzy span of attention attenuation adequately reflects the real reading context and enables the model to take advantage of more abundant information within the boundary of the attention span.The best performance is achieved by FSUIE-base (g l a ), which indicates that the attention should not decay too quickly at the boundary of the attention span, as evidenced by the results of g a .

Related Work on Sparse Attention
The high time and space complexity of Transformer (O(n 2 )) is due to the fact that it needs to calculate the attention information between each step and all previous contexts.This makes it difficult for Transformer to scale in terms of sequence length.To address this issue, sparse attention was proposed (Child et al., 2019).This refers to attention mechanisms that focus on a small subset of the input elements, rather than processing the entire input sequence.This method allows attention to be more focused on the most contributing value factors, thus reducing memory and computing capacity requirements.
Based on the idea of sparse attention, various approaches have been proposed, such as an adaptive width-based attention learning mechanism and a dynamic attention mechanism that allows different heads to learn only the region of attention (Sukhbaatar et al., 2019).Zaheer et al. (2020) proposed an O(N ) complexity model with three different sparse attentions.Zhuang et al. (2022) sought to make the sparse attention matrix predictable.This paper, however, based on adaptive span attention (Sukhbaatar et al., 2019) to establish a fuzzy span attention, which aims at learning a span-aware representation with the actual needs of information extraction tasks.Our approach differs from previous work in that we aim to obtain a fuzzy span of attention in the process of locating the target, rather than reducing computational and memory overhead.In Table 7, we present the performance of FSUIE-base models using various hyper-parameter d on the ADE NER test set.In Table 8, we present the performance of FSUIE-base models using various hyper-parameter L span on the ADE NER test set.The results demonstrate that the model's performance is not significantly affected by the choice of these hyper-parameters.Actually, there may be cases of single-side ambiguity in the labeling of entity boundaries in the text.Therefore, we demonstrate FSUIE-base models' performance with different FSL strategies in Table 9, where "single-side" means applying FSL only on start boundary and "both-side" means applying FSL on both start and end boundary.The results suggests that the influence of single-sided and both-sided fuzziness on the model's performance is limited, because not all head words are at the end or start, and FSL only performs limited left/right extrapolation on precise boundaries, without affecting the important information provided by the original boundary.For generalization purposes, we utilized both-sides fuzzy span in FSUIE.

Figure 2 :
Figure 2: Illustration of exact boundary and fuzzy boundary.

Figure 3 :
Figure 3: Illustration of attention mask function g m .

Figure 4 :
Figure 4: NER performance of different models on ACE04 test set.

Figure 5 :
Figure 5: Illustration of attention scores distribution in FSA layer.The extracting target are "walter rodgers" and "who".
1)where x represents the coordinate of boundaries within the fuzzy range [R min , R max ], R min and

Table 1 :
NER experimental results on ACE04, ACE05, and ADE datasets.During training, we set the parameters of the Gaussian distribution in FSL as σ = 0.5, the distribution value truncation threshold θ to 0.3, sampling step s to 0.3, and the loss coefficient λ to 0.01.And the parameter µ is set to the coordinate of annotation boundary.The hyper-parameters L span and d involved are determined based on the statistics of the target length on the UIE training data.During training, we set L span to 30 and d to 32, and experimentation results have shown that the model's performance is not significantly sensitive to the choice of these hyper-parameters (refer to comparison in Appendix C).
(Xu et al., 2020)al., 2012)ingappa et al., 2012)for NER and RE tasks, and ASTE-Data-V2(Xu et al., 2020)for ASTE task.We evaluate our model using different metrics for the three IE tasks.For NER, we use the Entity F 1 score, in which an entity prediction is correct if its span and type match a reference entity.For RE, we use the Relation Strict F 1 score, where a relation is considered correct only if its relation type and related entity spans are all correct.For ASTE, we use the Sentiment Triplet F 1 score, where a triplet is considered correct if the aspect, opinion, and sentiment polarity are all correctly identified.Training Details We trained two variations of FSUIE, FSUIE-base and FSUIE-large, which are based on the BERT-base and BERT-large model architecture and pre-training parameters respectively.In addition, we also trained a UIE-base based on BERT-base as a baseline without using FSL and FSA layers.In FSUIE, we added the FSA layer and the span boundary prediction layer to both models.Specifically, FSUIE-base has 12 layers of 12-head Transformer layers, with a hidden size of 768, while FSUIE-large has 24 layers of 16-head Transformer layers, with a hidden size of 1024.

Table 4 :
Experimental results on low-resource settings.

Table 5 :
Ablation study of FSL and FSA on NER task using ADE dataset.

Table 6 :
Performance of models using different g a in FSA

Table 7 :
span and d in FSA Performance of FSUIE-base models with different d

Table 8 :
Performance of FSUIE-base models with different L span

Table 9 :
-Side and Both-Side Ambiguity in FSL Performance of FSUIE-base models with different FSL strategies