SemEval-2021 Task 1: Lexical Complexity Prediction

This paper presents the results and main findings of SemEval-2021 Task 1 - Lexical Complexity Prediction. We provided participants with an augmented version of the CompLex Corpus (Shardlow et al. 2020). CompLex is an English multi-domain corpus in which words and multi-word expressions (MWEs) were annotated with respect to their complexity using a five point Likert scale. SemEval-2021 Task 1 featured two Sub-tasks: Sub-task 1 focused on single words and Sub-task 2 focused on MWEs. The competition attracted 198 teams in total, of which 54 teams submitted official runs on the test data to Sub-task 1 and 37 to Sub-task 2.


Introduction
The occurrence of an unknown word in a sentence can adversely affect its comprehension by readers. Either they give up, misinterpret, or plough on without understanding. A committed reader may take the time to look up a word and expand their vocabulary, but even in this case they must leave the text, undermining their concentration. The natural language processing solution is to identify candidate words in a text that may be too difficult for a reader (Shardlow, 2013;Paetzold and Specia, 2016a). Each potential word is assigned a judgment by a system to determine if it was deemed 'complex' or not. These scores indicate which words are likely to cause problems for a reader. The words that are identified as problematic can be the subject of numerous types of intervention, such as direct replacement in the setting of lexical simplification (Gooding and Kochmar, 2019), or extra information being given in the context of explanation generation .
Whereas previous solutions to this task have typically considered the Complex Word Identification (CWI) task (Paetzold and Specia, 2016a;Yimam et al., 2018) in which a binary judgment of a word's complexity is given (i.e., is a word complex or not?), we instead focus on the Lexical Complexity Prediction (LCP) task (Shardlow et al., 2020) in which a value is assigned from a continuous scale to identify a word's complexity (i.e., how complex is this word?). We ask multiple annotators to give a judgment on each instance in our corpus and take the average prediction as our complexity label. The former task (CWI) forces each user to make a subjective judgment about the nature of the word that models their personal vocabulary. Many factors may affect the annotator's judgment including their education level, first language, specialism or familiarity with the text at hand. The annotators may also disagree on the level of difficulty at which to label a word as complex. One annotator may label every word they feel is above average difficulty, another may label words that they feel unfamiliar with, but understand from the context, whereas another annotator may only label those words that they find totally incomprehensible, even in context. Our introduction of the LCP task seeks to address this annotator confusion by giving annotators a Likert scale to provide their judgments. Whilst annotators must still give a subjective judgment depending on their own understanding, familiarity and vocabulary -they do so in a way that better captures the meaning behind each judgment they have given. By aggregating these judgments we have developed a dataset that contains continuous labels in the range of 0-1 for each instance. This means that rather than a system predicting whether a word is complex or not (0 or 1), instead a system must now predict where, on our continuous scale, a word falls (0-1).
Consider the following sentence taken from a biomedical source, where the target word 'observation' has been highlighted: (1) The observation of unequal expression leads to a number of questions.
In the binary annotation setting of CWI some annotators may rightly consider this term non-complex, whereas others may rightly consider it to be complex. Whilst the meaning of the word is reasonably clear to someone with scientific training, the context in which it is used is unfamiliar for a lay reader and will likely lead to them considering it complex. In our new LCP setting, we are able to ask annotators to mark the word on a scale from very easy to very difficult. Each user can give their subjective interpretation on this scale indicating how difficult they found the word. Whilst annotators will inevitably disagree (some finding it more or less difficult), this is captured and quantified as part of our annotations, with a word of this type likely to lead to a medium complexity value.
LCP is useful as part of the wider task of lexical simplification (Devlin and Tait, 1998), where it can be used to both identify candidate words for simplification (Shardlow, 2013) and rank potential words as replacements . LCP is also relevant to the field of readability assessment, where knowing the proportion of complex words in a text helps to identify the overall complexity of the text (Dale and Chall., 1948). This paper presents SemEval-2021 Task 1: Lexical Complexity Prediction. In this task we developed a new dataset for complexity prediction based on the previously published CompLex dataset. Our dataset covers 10,800 instances spanning 3 genres and containing unigrams and bigrams as targets for complexity prediction. We solicited participants in our task and released a trial, training and test split in accordance with the SemEval schedule. We accepted submissions in two separate Sub-tasks, the first being single words only and the second taking single words and multi-word expressions (modelled by our bigrams). In total 55 teams participated across the two Sub-tasks.
The rest of this paper is structured as folllows: In Section 2 we discuss the previous two iterations of the CWI task. In Section 3, we present the CompLex 2.0 dataset that we have used for our task, including the methodology we used to produce trial, test and training splits. In Section 5, we show the results of the participating systems and compare the features that were used by each system. We finally discuss the nature of LCP in Section 7 and give concluding remarks in Section 8 2 Related Tasks CWI 2016 at SemEval The CWI shared task was organized at SemEval 2016 (Paetzold and Specia, 2016a). The CWI 2016 organizers introduced a new CWI dataset and reported the results of 42 CWI systems developed by 21 teams. Words in their dataset were considered complex if they were difficult to understand for non-native English speakers according to a binary labelling protocol. A word was considered complex if at least one of the annotators found it to be difficult. The training dataset consisted of 2,237 instances, each labelled by 20 annotators and the test dataset had 88,221 instances, each labelled by 1 annotator (Paetzold and Specia, 2016a).
A post-competition analysis (Zampieri et al., 2017) with oracle and ensemble methods showed that most systems performed poorly due mostly to the way in which the data was annotated and the the small size of the training dataset.
CWI 2018 at BEA The second CWI Shared Task was organized at the BEA workshop 2018 (Yimam et al., 2018). Unlike the first task, this second task had two objectives. The first objective was the binary complex or non-complex classification of target words. The second objective was regression or probabilistic classification in which 13 teams were asked to assign the probability of a target word being considered complex by a set of language learners. A major difference in this second task was that datasets of differing genres: (TEXT GENRES) as well as English, German and Spanish datasets for monolingual speakers and a French dataset for multilingual speakers were provided (Yimam et al., 2018).

Data
We previously reported on the annotation of the CompLex dataset (Shardlow et al., 2020) (hereafter referred to as CompLex 1.0), in which we annotated around 10,000 instances for lexical complexity using the Figure Eight platform. The instances spanned three genres: Europarl, taken from the proceedings of the European Parliament (Koehn, 2005); The Bible, taken from an electronic distribution of the World English Bible translation (Christodouloupoulos and Steedman, 2015) and Biomedical literature, taken from the CRAFT corpus (Bada et al., 2012). We limited our annotations to focus only on nouns and multi-word expressions following a Noun-Noun or Adjective-Noun pattern, using the POS tagger from Stanford CoreNLP (Manning et al., 2014) to identify these patterns.
Whilst these annotations allowed us to report on the dataset and to show some trends, the overall quality of the annotations we received was poor and we ended up discarding a large number of the annotations. For CompLex 1.0 we retained only instances with four or more annotations and the low number of annotations (average number of annotators = 7) led to the overall dataset being less reliable than initially expected For the Shared Task we chose to boost the number of annotations on the same data as used for CompLex 1.0 using Amazon's Mechanical Turk platform. We requested a further 10 annotations on each data instance bringing up the average number of annotators per instance. Annotators were presented with the same task layout as in the annotation of CompLex 1.0 and we defined the Likert Scale points as previously: Very Easy: Words which were very familiar to an annotator.
Easy: Words with which an annotator was aware of the meaning.
Neutral: A word which was neither difficult nor easy.
Difficult: Words which an annotator was unclear of the meaning, but may have been able to infer the meaning from the sentence.
Very Difficult: Words that an annotator had never seen before, or were very unclear.
These annotations were aggregated with the retained annotations of CompLex 1.0 to give our new dataset, CompLex 2.0, covering 10,800 instances across single and multi-words and across 3 genres. The features that make our corpus distinct from other corpora which focus on the CWI and LCP tasks are described below: Continuous Annotations: We have annotated our data using a 5-point Likert Scale. Each instance has been annotated multiple times and we have taken the mean average of these annotations as the label for each data instance. To calculate this average we converted the Likert Scale points to a continuous scale as follows: Very Easy → 0, Easy → 0.25, Neutral → 0.5, Difficult → 0.75, Very Difficult → 1.0.
Contextual Annotations: Each instance in the corpus is presented with its enclosing sentence as context. This ensures that the sense of a word can be identified when assigning it a complexity value. Whereas previous work has reannotated the data from the CWI-2018 shared task with word senses (Strohmaier et al., 2020), we do not make explicit sense distinctions between our tokens, instead leaving this task up to participants.
Repeated Token Instances: We provide more than one context for each token (up to a maximum of five contexts per genre). These words were annotated separately during annotation, with the expectation that tokens in different contexts would receive differing complexity values. This deliberately penalises systems that do not take the context of a word into account.
Multi-word Expressions: In our corpus we have provided 1,800 instances of multi-word expressions (split across our 3 sub-corpora). Each MWE is modelled as a Noun-Noun or Adjective-Noun pattern followed by any POS tag which is not a noun. This avoids selecting the first portion of complex noun phrases.
There is no guarantee that these will correspond to true MWEs that take on a meaning beyond the sum of their parts, and further investigation into the types of MWEs present in the corpus would be informative.
Aggregated Annotations: By aggregating the Likert scale labels we have generated crowdsourced complexity labels for each instance in our corpus. We are assuming that, although there is inevitably some noise in any large annotation project (and especially so in crowdsourcing), this will even out in the averaging process to give a mean value reflecting the appropriate complexity for each instance. By taking the mean average we are assuming unimodal distributions in our annotations.
Varied Genres: We have selected for diverse genres as mentioned above. Previous CWI datasets have focused on informal text such as Wikipedia and multi-genre text, such as news. By focusing on specific texts we force systems to learn generalised complexity annotations that are appropriate in a cross-genre setting.
We have presented summary statistics for Com-pLex 2.0 in Table 1. In total, 5,617 unique words are split across 10,800 contexts, with an average complexity across our entire dataset of 0.321. Each genre has 3,600 contexts, with each split between 3,000 single words and 600 multi-word expressions. Whereas single words are slightly below the average complexity of the dataset at 0.302, multiword expressions are much more complex at 0.419, indicating that annotators found these more difficult to understand. Similarly Europarl and the Bible were less complex than the corpus average, whereas the Biomedical articles were more complex. The number of unique tokens varies from one genre to another as the tokens were selected at random and discarded if there were already more than 5 occurrences of the given token already in the dataset. This stochastic selection process led to a varied dataset with some tokens only having one context, whereas others have as many as five in a given genre. On average each token has around 2 contexts.

Data Splits
In order to run the shared task we partitioned our dataset into Trial, Train and Test splits and distributed these according to the SemEval schedule. A criticism of previous CWI shared tasks is that the training data did not accurately reflect the distribution of instances in the testing data. We sought to avoid this by stratifying our selection process for a number of factors. The first factor we considered was genre. We ensured that an even number of instances from each genre was present in each split. We also stratified for complexity, ensuring that each split had a similar distribution of complexities. Finally we also stratified the splits by token, ensuring that multiple instances containing the same token occurred in only one split. This last criterion ensures that systems do not overfit to the test data by learning the complexities of specific tokens in the training data. Performing a robust stratification of a dataset according to multiple features is a non-trivial optimisation problem. We solved this by first grouping all instances in a genre by token and sorting these groups by the complexity of the least complex instance in the group. For each genre, we passed through this sorted list and for each set of 20 groups we put the first group in the trial set, the next two groups in the test set and the remaining 17 groups in the training data. This allowed us to get a rough 5-85-10 split between trial, training and test data. The trial and training data were released in this ordered format, however to prevent systems from guessing the labels based on the data ordering we randomised the order of the instances in the test data prior to release. The splits that we used for the Shared Task are available via GitHub 1 .

Results
The full results of our task can be seen in Appendix A. We had 55 teams participate in our 2 Sub-tasks, with 19 participating in Sub-task 1 only, 1 participating in Sub-task 2 only and 36 participating in both Sub-tasks. We have used Pearson's correlation for our final ranking of participants, but we have also included other metrics that are appropriate for evaluating continuous and ranked data and provided secondary rankings of these. Sub-task 1 asked participants to assign complexity values to each of the single words instances in our corpus. For Sub-task 2, we asked participants to submit results on both single words and MWEs. We did not rank participants on MWE-only submis-sions due to the relatively small number of MWEs in our corpus (184 in the test set).
The metrics we chose for ranking were as follows: Pearson's Correlation: We chose this metric as our primary method of ranking as it is well known and understood, especially in the context of evaluating systems with continuous outputs. Pearson's correlation is robust to changes in scale and measures how the input variables change with each other.
Spearman's Rank: This metric does not consider the values output by a system, or in the test labels, only the order of those labels. It was chosen as a secondary metric as it is more robust to outliers than Pearson's correlation.
Mean Absolute Error (MAE): Typically used for the evaluation of regression tasks, we included MAE as it gives an indication of how close the predicted labels were to the gold labels for our task.

Mean Squared Error (MSE):
There is little difference in the calculation of MSE vs. MAE, however we also include this metric for completeness.
R2: This measures the proportion of variance of the original labels captured by the predicted labels. It is possible to do well on all the other metrics, yet do poorly on R2 if a system produces annotations with a different distribution than those in the original labels.
In Table 3 we show the scores of the top 10 systems across our 2 Sub-tasks according to Pearson's  Correlation. We have only reported on Pearson's correlation and R2 in these tables, but the full results with all metrics are available in Appendix A.
We have included a Frequency Baseline produced using log-frequency from the Google Web1T and linear regression, which was beaten by the majority of our systems. From these results we can see that systems were able to attain reasonably high scores on our dataset, with the winning systems reporting Pearson's Correlation of 0.7886 for Sub-task 1 and 0.8612 for Sub-task 2, as well as high R2 scores of 0.6210 for Sub-task 1 and 0.7389 for Sub-task 2. The rankings remained stable across Spearman's rank, MAE and MSE, with some small variations. Scores were generally higher on Sub-task 2 than on Sub-task 1, and this is likely to be because of the different groups of token-types (single words and MWEs). MWEs are known to be more complex than single words and so this fact may have implictly helped systems to better model the variance of complexities between the two groups.

Participating Systems
In this section we have analysed the participating systems in our task. System Description papers were submitted by 32 teams. In the subsections below, we have first given brief summaries of some of the top systems according to Pearson's correlation for each task for which we had a description. We then discuss the features used across different systems, as well as the approaches to the task that different teams chose to take. We have prepared a comprehensive table comparing the features and approaches of all systems for which we have the relevant information in Appendix B.

System Summaries
DeepBlueAI: This system attained the highest Pearson's Correlation on Sub-task 2 and the second highest Pearson's Correlation on Sub-task 1. It also attained the highest R2 score across both tasks.
The system used an ensemble of pre-trained language models fine-tuned for the task with Pseudo Labelling, Data Augmentation, Stacked Training Models and Multi-Sample Dropout. The data was encoded for the transformer models using the genre and token as a query string and the given context as a supplementary input.
JUST BLUE: This system attained the highest Pearson's Correlation for Sub-task 1. The system did not participate in Sub-task 2. This system makes use of an ensemble of BERT and RoBERTa. Separate models are fine-tuned for context and token prediction and these are weighted 20-80 respectively. The average of the BERT models and RoBERTa models is taken to give a final score.
RG PA: This system attained the second highest Pearson's Correlation for Sub-task 2. The system uses a fine-tuned RoBERTa model and boosts the training data for the second task by identifying similar examples from the single-word portion of the dataset to train the multi-word classifier. They use an ensemble of RoBERTa models in their final classification, averaging the outputs to enhance performance.
Alejandro Mosquera: This system attained the third highest Pearson's Correlation for Sub-task 1.
The system used a feature-based approach, incorporating length, frequency, semantic features from WordNet and sentence level readability features. These were passed through a Gradient Boosted Regression.
Andi: This system attained the fourth highest Pearson's Correlation for Sub-task 1. They combine a traditional feature based approach with features from pre-trained language models. They use psycholinguistic features, as well as GLoVE and Word2Vec Embeddings. They also take features from an ensemble of Language models: BERT, RoBERTa, ELECTRA, ALBERT, DeBERTa. All features are passed through Gradient Boosted Regression to give the final output score.
CS-UM6P: This system attained the fifth highest Pearson's Correlation for Sub-task 1 and the seventh highest Pearson's Correlation for Sub-task 2.
The system uses BERT and RoBERTa and encodes the context and token for the language models to learn from. Interestingly, whilst this system scored highly for Pearson's correlation the R2 metric is much lower on both Sub-tasks. This may indicate the presence of significant outliers in the system's output.
OCHADAI-KYOTO: This system attained the seventh highest Pearson's Correlation on Sub-task 1 and the eight highest Pearson's Correlation on Sub-task 2. The system used a fine-tuned BERT and RoBERTa model with the token and context encoded. They employed multiple training strategies to boost performance.

Approaches
There are three main types of systems that were submitted to our task. In line with the state of the art in modern NLP, these can be categorised as: Feature-based systems, Deep Learning Systems and Systems which use a hybrid of the former two approaches. Although Deep Learning Based systems have attained the highest Pearson's Correlation on both Sub-tasks, occupying the first two places in each task, Feature based systems are not far behind, attaining the third and fourth spots on Sub-task 1 with a similar score to the top systems.
We have described each approach as applied to our task below. Feature-based systems use a variety of features known to be useful for lexical complexity. In particular, lexical frequency and word length feature heavily with many different ways of calculating these metrics such as looking at various corpora and investigating syllable or morpheme length. Psycholinguistic features which model people's perception of words are understandably popular for this task as complexity is a perceived phenomenon. Semantic features taken from WordNet modelling the sense of the word and it's ambiguity or abstractness have been used widely, as well as sentence level features aiming to model the context around the target words. Some systems chose to identify named entities, as these may be innately more difficult for a reader. Word inclusion lists were also a popular feature, denoting whether a word was found on a given list of easy to read vocabulary. Finally, word embeddings are a popular feature, coming from static resources such as GLoVE or Word2Vec, but also being derived through the use of Transformer models such as BERT, RoBERTa, XLNet or GPT-2, which provide context dependent embeddings suitable for our task.
These features are passed through a regression system, with Gradient Boosted Regression and Random Forest Regression being two popular approaches amongst participants for this task. Both apply scale invariance meaning that less preprocessing of inputs is necessary.
Deep Learning Based systems invariably rely on a pre-trained language model and fine-tune this using transfer learning to attain strong scores on the task. BERT and RoBERTa were used widely in our task, with some participants also opting for AL-BERT, ERNIE, or other such language models. To prepare data for these language models, most participants following this approach concatenated the token with the context, separated by a special token ( SEP ). The Language Model was then trained and the embedding of the CLS token extracted and passed through a further fine-tuned network for complexity prediction. Adaptations to this methodology include applying training strategies such as adversarial training, multi-task learning, dummy annotation generation and capsule networks.
Finally, hybrid approaches use a mixture of Deep Learning by fine-tuning a neural network alongside feature-based approaches. The features may be concatenated to the input embeddings, or may be concatenated at the output prior to further training. Whilst this strategy appears to be the best of both worlds, uniting linguistic knowledge with the power of pre-trained language models, the hybrid systems do not tend to perform as well as either feature based or deep learning systems.

MWEs
For Sub-task 2 we asked participants to submit both predictions for single words and multi-words from our corpus. We hoped this would encourage participants to consider models that adapted single word lexical complexity to multi-word lexical complexity. We observed a number of strategies that participants employed to create the annotations for this secondary portion of our data.
For systems that employed a deep learning approach, it was relatively simple to incorporate MWEs as part of their training procedure. These systems encoded the input using a query and context, separated by a SEP token. The number of tokens prior to the SEP token did not matter and either one or two tokens could be placed there to handle single and multi-word instances simultaneously.
However, feature based systems could not employ this trick and needed to devise more imaginative strategies for handling MWEs. Some systems handled them by averaging the features of both tokens in the MWE, or by predicting scores for each token and then averaging these scores. Other systems doubled their feature space for MWEs and trained a new model which took the features of both words into account.

Discussion
In this paper we have posited the new task of Lexical Complexity Prediction. This builds on previous work on Complex Word Identification, specifically by providing annotations which are continuous rather than binary or probabilistic as in previous tasks. Additionally, we provided a dataset with annotations in context, covering three diverse genres and incorporating MWEs, as well as single tokens. We have moved towards this task, rather than rerunning another CWI task as the outputs of the models are more useful for a diverse range of follow-on tasks. For example, whereas CWI is particularly useful as a preprocessing step for Lexical simplification (identifying which words should be transformed), LCP may also be useful for readability assessment or as a rich feature in other downstream NLP tasks. A continuous annotation allows a ranking to be given over words, rather than binary categories, meaning that we can not only tell whether a word is likely to be difficult for a reader, but also how difficult that word is likely to be. If a system requires binary complexity (as in the case of lexical simplification) it is easy to transform our continuous complexity values into a binary value by placing a threshold on the complex-ity scale. The value of the threshold to be selected will likely depend on the target audience, with more competent speakers requiring a higher threshold. When selecting a threshold, the categories we used for annotation should be taken into account, so for example a threshold of 0.5 would indicate all words that were rated as neutral or above.
To create our annotated dataset, we employed crowdsourcing with a Likert scale and aggregated the categorical judgments on this scale to give a continuous annotation. It should be noted that this is not the same as giving a truly continuous judgment (i.e., asking each annotator to give a value between 0 and 1). We selected this protocol as the Likert Scale is familiar to annotators and allows them to select according to defined points (we provided the definitions given earlier at annotation time). The annotation points that we gave were not intended to give an even distribution of annotations and it was our expectation that most words would be familiar to some degree, falling in the very easy or easy categories. We pre-selected for harder words to ensure that there were also words in the difficult and very difficult categories. As such, the corpus we have presented is not designed to be representative of the distribution of words across the English language. To create such a corpus, one would need to annotate all words according to our scale with no filtering. The general distribution of annotations in our corpus is towards the easier end of the Likert scale.
A criticism of the approach we have employed is that it allows for subjectivity in the annotation process. Certainly one annotator's perception of complexity will be different to another's. Giving fixed values of complexity for every word will not reflect the specific difficulties that one reader, or one reader group will face. The annotations we have provided are averaged values of the annotations given by our annotators, we chose to keep all instances, rather than filtering out those where annotators gave a wide spread of complexity annotations. Further work may be undertaken to give interesting insights into the nature of subjectivity in annotations. For example, some words may be rated as easy or difficult by all annotators, whereas others may receive both easy and difficult annotations, indicating that the perceived complexity of the instance is more subjective. We did not make the individual annotations available as part of the shared task data, to encourage systems to focus primarily on the prediction of complexity.
An issue with the previous shared tasks is that scores were typically low and that systems tended to struggle to beat reasonable baselines, such as those based on lexical frequency. We were pleased to see that systems participating in our task returned scores that indicated that they had learnt to model the problem well (Pearson's Correlation of 0.7886 on Task 1 and 0.8612 on Task 2). MWEs are typically more complex than single words and it may be the case that these exhibited a lower variance, and were thus easier to predict for the systems. The strong Pearson's Correlation is backed up by a high R2 score (0.6172 for Task 1 and 0.7389 for Task 2), which indicates that the variance in the data is captured accurately by the models' predictions. These models strongly outperformed a reasonable baseline based on word frequency as shown in Table 3.
Whilst we have chosen in this report to rank systems based on their score on Pearson's correlation, giving a final ranking over all systems, it should be noted that there is very little variation in score between the top systems and all other systems. For Task 1 there are 0.0182 points of Pearson's Correlation separating the systems at ranks 1 and 10. For Task 2 a similar difference of 0.021 points of Pearson's Correlation separates the systems at ranks 1 and 10. These are small differences and it may be the case that had we selected a different random split in our dataset this would have led to a different ordering in our results (Gorman and Bedrick, 2019; Søgaard et al., 2020). This is not unique to our task and is something for the SemEval community to ruminate on as the focus of NLP tasks continues to move towards better evaluation rather than better systems.
An analysis of the systems that participated in our task showed that there was little variation between Deep Learning approaches and Feature Based approaches, although Deep Learning approaches ultimately attained the highest scores on our data. Generally the Deep Learning and Feature Based approaches are interleaved in our results table, showing that both approaches are still relevant for LCP. One factor that did appear to affect system output was the inclusion of context, whether that was in a deep learning setting or a feature based setting. Systems which reported using no context appeared to perform worse in the overall rankings. Another feature that may have helped performance is the inclusion of previous CWI datasets (Yimam et al., 2017;Maddela and Xu, 2018). We were aware of these when developing the corpus and attempted to make our data sufficiently distinct in style to prevent direct reuse of these resources. A limitation of our task is that it focuses solely on LCP for the English Language. Previous CWI shared tasks (Yimam et al., 2018) and simplification efforts (Saggion et al., 2015;Aluísio and Gasperin, 2010) have focused on languages other than English and we hope to extend this task in the future to other languages.

Conclusion
We have presented the SemEval-2021 Task 1 on Lexical Complexity Prediction. We developed a new dataset focusing on continuous annotations in context across three genres. We solicited participants via SemEval and 55 teams submitted results across our two Sub-tasks. We have shown the results of these systems and discussed the factors that helped systems to perform well. We have analysed all the systems that participated and categorised their findings to help future researchers understand which approaches are suitable for LCP.