Multilingual Detection of Personal Employment Status on Twitter

Detecting disclosures of individuals’ employment status on social media can provide valuable information to match job seekers with suitable vacancies, offer social protection, or measure labor market flows. However, identifying such personal disclosures is a challenging task due to their rarity in a sea of social media content and the variety of linguistic forms used to describe them. Here, we examine three Active Learning (AL) strategies in real-world settings of extreme class imbalance, and identify five types of disclosures about individuals’ employment status (e.g. job loss) in three languages using BERT-based classification models. Our findings show that, even under extreme imbalance settings, a small number of AL iterations is sufficient to obtain large and significant gains in precision, recall, and diversity of results compared to a supervised baseline with the same number of labels. We also find that no AL strategy consistently outperforms the rest. Qualitative analysis suggests that AL helps focus the attention mechanism of BERT on core terms and adjust the boundaries of semantic expansion, highlighting the importance of interpretable models to provide greater control and visibility into this dynamic learning process.


Introduction
Up-to-date information on individuals' employment status is of tremendous value for a wide range of economic decisions, from firms filling job vacancies to governments designing social protection systems.At the aggregate level, estimates of labor market conditions are traditionally based on nationally representative surveys that are costly to produce, especially in low-and middle-income countries (Devarajan, 2013;Jerven, 2013).As social media becomes more ubiquitous all over the world, more individuals can now share their employment status with peers and unlock the social capital of their networks.This, in turn, can provide a new lens to examine the labor market and devise policy, especially in countries where traditional measures are lagging or unreliable.
A key challenge in using social media to identify personal disclosures of employment status is that such statements are extremely rare in an abundance of social media content -roughly one in every 10,000 posts -which renders random sampling ineffective and prohibitively costly for the development of a large labeled dataset.On the other hand, simple keyword-based approaches run the risk of providing seemingly high-accuracy classifiers while substantially missing linguistic variety used to describe events such as losing a job, looking for a job, or starting a new position (see Figure 1 for example).In the absence of a high-quality, comprehensive, and diverse ground-truth about personal employment disclosures, it is difficult to develop classification models that accurately capture the flows in and out of the labor market in any country, let alone robustly estimating it across multiple countries.Furthermore, state-of-the-art deep neural models provide little visibility into or control over the linguistic patterns captured by the model, which hampers the ability of researchers and practitioners to determine whether the model has truly learned new linguistic forms and sufficiently converged.
Active Learning (AL) is designed for settings where there is an abundance of unlabeled examples and limited labeling resources (Cohn et al., 1994).It aims to focus the learning process on the most informative samples and maximize model perfor-mance for a given labeling budget.In recent years, AL proved successful in several settings, including policy-relevant tasks involving social media data (Pohl et al., 2018;Palakodety et al., 2020).
The success of pre-trained language models such as BERT (Devlin et al., 2019) in a variety of language understanding tasks has sparked interest in using AL with these models for imbalanced text classification.Yet, most research in this field has focused on artificially-generated rarity in data or imbalance that is not as extreme as the present setting (Ein-Dor et al., 2020;Schröder et al., 2021).Therefore, there is no evidence of the efficiency of AL using BERT-based models for sequence classification in real-world settings with extreme imbalance.It is unclear whether some AL strategies will perform significantly better than others in these settings, how quickly the different strategies will reach convergence (if at all), and how the different strategies will explore the linguistic space.
In this work, we leverage BERT-based models (Devlin et al., 2019) in three different AL paradigms to identify tweets that disclose an individual's employment status or change thereof.We train classifiers in English, Spanish, and Portuguese to determine whether the author of a tweet recently lost her job, was recently hired, is currently unemployed, posting to find a job, or posting a job offer.We use two standard AL strategies, Uncertainty Sampling (Lewis and Gale, 1994) and Adaptive Retrieval (Mussmann et al., 2020), and propose a novel strategy we name Exploit-Explore Retrieval that uses k-skip-n-grams (n-grams with k skipped tokens) to explore the space and provide improved interpretability.We evaluate the models both quantitatively and qualitatively across languages and AL strategies, and compare them to a supervised learning baseline with the same number of labels.Therefore, our contributions are: • An evaluation of three AL strategies for BERTbased binary classification under extreme class imbalance using real-world data.
• A novel AL strategy for sequence classification that performs on par with other strategies, but provides additional interpretability and control over the learning process.
• A qualitative analysis of the linguistic patterns captured by BERT across AL strategies.
• A large labeled dataset of tweets about unemployment and fine-tuned models in three languages to stimulate research in this area1 .
2 Background and related work
A key challenge in identifying self-disclosures on social media is the rare and varied nature of such content with a limited labeling budget.Prior work that studied self-disclosures on Twitter had either used pattern matching, which is prone to large classification errors (Antenucci et al., 2014;Proserpio et al., 2016), or focused on curated datasets (Li et al., 2014;Preoţiuc-Pietro et al., 2015;Sarker et al., 2018;Ghosh Chowdhury et al., 2019), which provide no guarantees about recall or coverage of the positive class.These issues are more severe in real-world settings of extreme imbalance, where random sampling is unlikely to retrieve any positives, let alone diverse.These challenges motivate the use of AL, as described next.

Active Learning
AL has been used successfully in various settings to maximize classification performance for a given labeling budget (see Settles (1995) for a survey).
With the emergence of pre-trained language models such as BERT (Devlin et al., 2019) and their success across a number of different language tasks, recent work has studied the combination of AL and BERT, either by using BERT to enhance traditional AL methods (Yuan et al., 2020) or by applying established AL methods to improve BERT's classification performance (Zhang and Zhang, 2019;Shelmanov et al., 2019;Liu et al., 2020;Grießhaber et al., 2020;Prabhu et al., 2021;Schröder et al., 2021).
In the specific case of binary classification with moderate class imbalance, Ein-Dor et al. (2020) show that AL with BERT significantly outperforms random sampling but that no single AL strategy stands out in terms of BERT-based classification performance, both for balanced and imbalanced settings.Yet, the authors only consider a relatively moderate class imbalance of 10-15% positives, and does not cover extreme imbalance, which is common in many text classification tasks.Our current research examines a considerably more extreme imbalance of about 0.01% positives, where traditional AL approaches can be ineffective (Attenberg and Provost, 2010).Under this extreme imbalance, Mussmann et al. (2020) show the potential of AL for BERT to outperform random sampling for pairwise classification.To the best of our knowledge, this work is the first to compare the performance of AL methods for BERT-based sequence classification in real-world extreme imbalance settings.
3 Experimental procedure

Data collection
Our dataset was collected from the Twitter API.It contains the timelines of the users with at least one tweet in the Twitter Decahose and with an inferred profile location in the United States, Brazil, and Mexico.In addition to the United States, we chose to focus on Brazil and Mexico as both of them are middle-income countries where Twitter's penetration rate is relatively high.For each country, we drew a random sample of 200 million tweets covering the period between January 2007 and December 2020 and excluding retweets.We then split it evenly in two mutually exclusive random samples R e and R s .In the following sections, we use R e to evaluate each model's performance in a real-world setting and R s to sample new tweets to label.
Our labeling process sought to identify four nonexclusive, binary states that workers may experience during their career: losing a job ("Lost Job"), being unemployed ("Is Unemployed"), searching for a job ("Job Search"), and finding a job ("Is Hired").We only considered first-person disclosures as positives.For the classes "Lost Job" and "Is Hired", we only considered such events that happened in the past month as positives as we want to determine the user's current employment status.To complement the focus on workers, we also labeled tweets containing job offers ("Job Offer").We used Amazon Mechanical Turk (MTurk) to label tweets according to these 5 classes (see Figure 1 and Section A.2 for details).

Initialization sample
As previously stated, the extreme imbalance of our classification task of one positive example for every 10,000 tweets renders random sampling ineffective and prohibitively costly.In order to build highperforming classifiers at a reasonable cost, we selected a set of 4 to 7 seed keywords that are highly specific of the positives and frequent enough for each class and country.To do so, we defined a list of candidate seeds, drawing from Antenucci et al. (2014) for the US and asking native speakers in the case of Mexico and Brazil, and individually evaluated their specificity and frequency (see Section A.1 for additional details).We then randomly sampled 150 tweets containing each seed from R s , allowing us to produce a stratified sample L 0 of 4,524 English tweets, 2703 Portuguese tweets, and 3729 Spanish tweets respectively (Alg.1).We then labeled each tweet using Amazon Mechanical Turk (MTurk) allowing us to construct a languagespecific stratified sample that is common to the 5 classes (see Section A.3 for descriptive statistics of the stratified sample).

Models
We trained five binary classifiers to predict each of the five aforementioned labeled classes.Preliminary analysis found that BERT-based models considerably and consistently outperformed keywordbased models, static embedding models, and the combination of these models.We benchmarked several BERT-based models and found that the following models gave the best performance on our task: Conversational BERT for English tweets (Burtsev et al., 2018), BERTimbau for Brazilian Portuguese tweets (Souza et al., 2020) and BETO for Mexican Spanish tweets (Cañete et al., 2020) (see Section A.4 for details on model selection).
We fine-tuned each BERT-based model on a 70:30 train-test split of the labeled tweets for 20 epochs (Alg.1).Following Dodge et al. (2020), we repeated this process for 15 different random seeds and retained the best performing model in terms of area under the ROC curve (AUROC) on the test set at or after the first epoch (see Section A.5 for details).

Model evaluation
While the standard classification performance measure in an imbalanced setting is the F1 score with a fixed classification threshold (e.g.0.5), it is not applicable in our case for two reasons.First, we care about the performance on a large random set of tweets and the only labeled set we could compute the F1 metric from is the stratified test set which is not representative of the extremely imbalanced random sample R e .Second, the fact that neural networks are poorly calibrated (Guo et al., 2017) makes the choice of a predefined classification threshold somewhat arbitrary and most likely sub-optimal.
We developed an alternative threshold-setting evaluation strategy.First, we computed the predicted score of each tweet in R e (Alg.1), which is a random sample.Then, for each class, we labeled 200 tweets in R e along the score distribution (see section A.7.1 for more details).We measured the performance of each classifier on R e by computing: • the Average Precision as common in information retrieval.• the number of predicted positives, defined as the average rank in the confidence score distribution when the share of positives reaches 0.5.• the diversity, defined as the average pairwise distance between true positives.Details about the evaluation metrics can be found in Section A.7.
• Evaluation: sample tweets along the score distribution in R e ; have them labeled; compute the average precision, number of predicted positives and diversity metrics Algorithm 1: Experimental procedure

Active Learning strategies
Next, we used pool-based AL (Settles, 1995) in batch mode, with each class-specific fine-tuned model as the classification model, in order to query new informative tweets in R s .We compared three different AL strategies aiming to balance the goal of improving the precision of a classifier while expanding the number and the diversity of detected positives instances: • Uncertainty Sampling consists in sampling instances that a model is most uncertain about.In a binary classification problem, the standard approach is to select examples with a predicted score close to 0.5 (Settles, 2009).In practice, this rule of thumb might not always lead to identify uncertain samples when imbalance is high (Mussmann et al., 2020), especially with neural network models known to be poorly calibrated (Guo et al., 2017).To overcome this issue, we contrast a naive approach which consists in querying the 100 instances whose uncalibrated scores are the closest to 0.5, to an approach that uses calibrated scores (see Section A.9 for details).• Adaptive Retrieval aims to maximize the precision of a model by querying instances for which the model is most confident of their positivity (Mussmann et al., 2020).This approach is related to certainty sampling (Attenberg et al., 2010).
Here, we select the 100 tweets whose predicted score is the highest for each class.Additionally, we compared these AL strategies to a supervised Stratified Sampling baseline, that consists of the same initial motifs defined in Section 3.2 and the same number of labels as available to all other AL strategies.Overall, for each strategy, each iteration and each class, we labeled 100 new tweets in R s .We then combined the 500 new labels across classes with the existing ones to finetune and evaluate a new BERT-based model for each class as described in Section 3.3, which we then used to select tweets for labeling for the next iteration.We considered that an AL strategy had converged when there was no significant variation of average precision, number of predicted positives and diversity for at least two iterations (see Section A.7.6 for details).

Initial sample
At iteration 0, we fine-tuned a BERT-based classifier on a 70:30 train-test split of the initialization sample L 0 for each class and country.All the AU-ROC values on the test set are reported in Table 7.
We obtain very high AUROCs ranging from 0.944 to 0.993 across classes and countries."Job Offer" has the highest AUROCs with values ranging from 0.985 for English to 0.991 for Portuguese and 0.993 for Spanish.Upon closer examination of positives for this class, we find that the linguistic structure of tweets mentioning job offers is highly repetitive, a large share of these tweets containing sentences such as "We're #hiring!Click to apply:" or naming job listing platforms (e.g: "#Ca-reerArc").By contrast, the most difficult class to predict is "Lost Job", with an AUROC on the test set equal to 0.959 for English and 0.944 for Spanish.This class also has the highest imbalance, with approximately 6% of positives in the stratified sample for these two languages.
Taken together, these results show that a finetuned BERT model can achieve very high classification performance on a stratified sample of tweets across classes and languages.However, these numbers cannot be extrapolated to directly infer the models' performance on random tweets, which we discuss in the next section.

Active Learning across languages
Next, we compared the performance of our exploitexplore retrieval strategy on English, Spanish and Portuguese tweets.We used exploit-explore retrieval as it provides similar results to other strategies (Section 4.3), while allowing greater visibility into selected motifs during the development process (Section 4.4).We ran 8 AL iterations for each language and report the results in Fig. 2, Fig. 5 and Table 10.
First, we observe substantial improvements in average precision (AP) across countries and classes with just one or two iterations.These improvements are especially salient in cases where precision at iteration 0 is very low.For instance, for the English "Is Unemployed" class and the Spanish "Is Hired" class, average precision goes respectively from 0.14 and 0.07 to 0.83 and 0.8 from iteration 0 to iteration 1 (Fig. 2 and Fig. 5).A notable exception to this trend is the class "Job Offer", especially for English and Portuguese.These performance differences can in part be explained by the varying quality of the initial seed list across classes.This is confirmed by the stratified sampling baseline performance discussed in 4.3.In the case of "Job Offer", an additional explanation discussed earlier in Section 4.1 is the repetitive structure of job offers in tweets which makes this class easier to detect compared to others.
Also, the class "Lost Job" has the worst performance in terms of AP across countries.One reason is that the data imbalance for this class is even higher than for other classes, as mentioned in Section 4.1.Another explanation for the low precision is the ambiguity inherent to the recency constraint, namely that an individual must have lost her job at most one month prior to posting the tweet.
Apart from the "Job Offer" class in English and Portuguese, AL consistently allows to quickly expand from iteration 0 levels with the number of predicted positives multiplied by a factor of up to 10 4 (Fig. 2).Combined with high AP values, this result means that the classifiers manage to capture substantially more positives compared to iteration 0. This high expansion is combined with increasing semantic diversity among true positive instances.
The class "Job Offer" stands out with little expansion and diversity changes in the English and Portuguese cases.For Spanish, expansion and diversity changes are higher.One explanation is that the structure of Mexican job offers is less repetitive, with individual companies frequently posting job offers, as opposed to job aggregators in the case of the US and Brazil.
Overall, apart from a few edge cases, we find that AL used with pre-trained language models is successful at significantly improving precision while expanding the number and the diversity of predicted positive instances in a small number of iterations across languages.Indeed, precision gains reach up to 90 percentage points from iteration 0 to the last iteration across languages and classes and the number of predicted positives is multiplied English (green), Portuguese (orange), and Spanish (purple).We report the standard error of the average precision and diversity estimates, and we report a lower and an upper bound for the number of predicted positives.Additional details on how the evaluation metrics are computed are reported in section A.7. by a factor of up to 10 4 .Furthermore, on average, the model converges in only 5.6 iterations across classes for English and Portuguese, and in 4.4 iterations for Spanish (see Table 10 for details).

Comparing Active Learning strategies
In this section, we evaluated on English tweets the stratified sampling baseline and the four AL strategies described in Section 3.5, namely exploitexplore retrieval, adaptive retrieval and uncertainty sampling with and without calibration.We ran five iterations for each strategy and reported the results on Figure 3 in this section as well as Table 11 and Figure 6 in Section A.10.
We find that AL brings an order of magnitude more positives and does so while preserving or improving both the precision and the diversity of results.Apart from the "Job Offer" class discussed in Section 4.2, AL consistently outperforms the stratified sampling baseline.This is especially true for the classes "Is Unemployed" and "Lost Job" where the baseline performance stagnates at a low level, suggesting a poor seed choice, but also holds for classes "Is Hired" and "Job Search" with stronger baseline performance.We also find that no AL strategy consistently dominates the rest in terms of precision, number and diversity of positives.The gains in performance are similar across AL strategies, and are particularly high for the classes "Lost Job" and "Is Unemployed", which start with a low precision.The number of predicted positives and the diversity measures also follow similar trends across classes and iterations.
We also observe occasional "drops" in average precision of more than 25% from one iteration to the next.Uncalibrated uncertainty sampling seems particularly susceptible to these drops, with at least one occurrence for each class.Upon examination of the tweets sampled for labeling by this strategy, the vast majority of tweets are negatives and when a few positives emerge, their number is not large enough to allow the model to generalize well.This variability slows down the convergence process of uncertainty sampling when it is not uncalibrated (table 11).In contrast, calibrated uncertainty sam- across AL strategies.We report the standard error of the average precision and diversity estimates, and we report a lower and an upper bound for the number of predicted positives.Additional details on how the evaluation metrics are computed are reported in section A.7.
pling is less susceptible to these swings, emphasizing the importance of calibration for more "stable" convergence in settings of extreme imbalance.
Taken together, our quantitative results show that the positive impact of AL on classification performance in an extremely imbalanced setting holds across AL strategies.Aside from a few occasional performance "drops", we find significant gains in precision, expansion and diversity across strategies.Yet, we find that no AL strategy consistently dominates the others across a range of prediction tasks for which the number and the linguistic complexity of positive instances vary widely.Next, we investigate the results qualitatively to gain deeper understanding of the learning process.

Qualitative analysis
We qualitatively examined the tweets selected for labeling by each strategy to understand better what BERT-based models capture and reflect on the quantitative results.We focused on English tweets only and took a subsample of tweets at each iteration to better understand each strategy's perfor-mance.We excluded the "Job Offer" class from this analysis since the performance, in this case, is exceptionally high, even at iteration 0.
Our analysis finds that many tweets queried by the various AL strategies capture a general "tone" that is present in tweets about unemployment, but that is not specific to one's employment status.For example, these include tweets of the form of "I'm excited to ... in two days" for the recently hired class, "I've been in a shitty mood for ..." for unemployment or "I lost my ..." for job loss.This type of false positives seems to wane down as the AL iterations progress, which suggests that a key to the success of AL is first to fine-tune the attention mechanism to focus on the core terms and not the accompanying text that is not specific to employment status.In the stratified sampling case, the focus on this unemployment "tone" remains uncorrected, explaining the poor performance for classes "Lost Job" and "Is Unemployed" and the performance drops for "Is Hired" and "Job Search".
A second theme in tweets queried by AL in-volves the refinement of the initial motifs.Uncertainty sampling (calibrated and uncalibrated), adaptive retrieval, and the exploitation part of our exploit-explore retrieval method seem to query tweets that either directly contain a seed motif or a close variant thereof.For example, tweets for the class "Lost Job" may contain the seed motifs "laid off", "lost my job", and "just got fired".As mentioned in Section 4.2 to explain occasional drops in performance, many tweets labeled as negatives contain over-generalization of the semantic concept such as expanding to other types of losses (e.g."lost my phone"), other types of actions (e.g."got pissed off"), or simply miss the dependence on firstperson pronouns (e.g."@user got fired").Many of the positively labeled tweets contain more subtle linguistic variants that do not change the core concept such as "I really need a job", "I really need to get a job", "I need to find a job", or "I need a freaken job".Adaptive retrieval chooses these subtle variants more heavily than other strategies with some iterations mostly populated with "I need a job" variants.Overall, these patterns are consistent with a view of the learning process, specifically the classification layer of the BERT model, as seeking to find the appropriate boundaries of the target concept.
Finally, the exploration part of the exploitexplore retrieval makes the search for new forms of expression about unemployment more explicit and interpretable.For example, the patterns explored in the first few iterations of explore-exploit retrieval include "I ... lost ... today", "quit .. my .. job", "I ... start my ... today", and "I'm ... in ... need".A detailed presentation of the explored k-skip-ngrams for US tweets can be found in Table 9 of Section A.8.While this strategy suffers from issues that also affect other AL strategies, we find that the explore part of exploit-explore retrieval is more capable of finding new terms that were not part of the seed list (e.g., quit, career) and provides the researcher with greater insight into and control over the AL process.

Discussion and conclusion
This work developed and evaluated BERT-based models in three languages and used three different AL strategies to identify tweets related to an individual's employment status.Our results show that AL achieves large and significant improvements in precision, expansion, and diversity over stratified sampling with only a few iterations and across languages.In most cases, AL brings an order of magnitude more positives while preserving or improving both the precision and diversity of results.Despite using fundamentally different AL strategies, we observe that no strategy consistently outperforms the rest.Within the extreme imbalance setting, this is in line with -and complements -the findings of Ein-Dor et al. (2020).
Additionally, our qualitative analysis and exploration of the exploit-explore retrieval give further insights into the performance improvements provided by AL, finding that substantial amounts of queried tweets hone the model's focus on employment rather than surrounding context and expand the variety of motifs identified as positive.This puts exploit-explore retrieval as a valuable tool for researchers to obtain greater visibility into the AL process in extreme imbalance cases without compromising on performance.
While the present work demonstrates the potential of AL for BERT-based models under extreme imbalance, an important direction for future work would be to further optimize the AL process.One could for instance study the impact on performance of the stratified sample size or the AL batch size.To overcome the poor seed quality for some classes, other seed generation approaches could be tested, such as mining online unemployment forums using topic modeling techniques to discover different ways to talk about unemployment.In terms of model training and inference, the use of multitask learning for further performance improvement could be studied due to the fact that classes of unemployment are not mutually exclusive.We hope that our experimental results as well as the resources we make available will help bridge these gaps in the literature.To select initial seed motifs, we used the list of initial motifs elaborated by Antenucci et al. (2014).We also imposed extra requirements on additional motifs, such as the presence of first-person pronouns (e.g."I got fired" for the "Lost Job" class), as we restricted the analysis to the author's own labor market situation.We also used adverbs such as "just" to take into account the temporal constraint for classes "Lost Job" and "Is Hired".For Mexican Spanish and Brazilian Portuguese motifs, we both translated the English motifs and asked native speakers to confirm the relevance of translations and add new seeds (e.g."chamba" is a Mexican Spanish slang word for "work").We then ran a similar selection process.
For each of the candidate seed motif, we computed specificity and frequency on the random set R e .For each class χ, we defined specificity for a given motif M as the share of positives for class χ in a random sample of 20 tweets from R e that contain M .The frequency of motif M is defined as the share of tweets in R e that contain M .
In order to have motifs that are both frequent and specific enough, we defined the following selection rule: we only retained motifs that have a specificity of or over 1% and for which the product of specificity and frequency is above 1.10 −7 .
In total , we evaluated a total of 54 seeds for the US, 101 for Mexico and 42 for Brazil.After evaluation, we retained 26 seeds for the US, 26 for MX and 21 for Brazil.We report the retained motifs in Table 1.

A.2 Data labeling
To label unemployment-related tweets, we used the crowdsourcing platform Amazon Mechanical Turk.This platform has the advantage of having an international workforce speaking several languages, including Spanish and Brazilian Portuguese on top of English.
For each tweet to label, turkers were asked the five questions listed in Table 2.Each turker was presented with a list of 50 tweets and each labeled tweet was evaluated by at least two turkers.A turker could choose to answer either yes, no or, I am not sure.We included two attention check questions to exclude low-quality answers.Regarding the attention checks, we had the two following sentences labeled: "I lost my job today", which is a positive for class "Lost Job" and "Is Unemployed" and negative for the other classes, and "I got hired today", which is a positive for the class "Is Hired" and a negative for the other classes.We discarded answers of workers who didn't give the five correct labels for each quality check.To create a label for a given tweet, we required that at least two workers provided the same answer.A yes was then converted to a positive label, a no to a negative label, a tweet labeled by two workers as unsure was dropped from the sample.
During this labeling process, all workers were paid with an hourly income above the minimum wage in their respective countries.For a labeling task of approximately 15 minutes, turkers from the US, Mexico and Brazil received respectively 5USD, 5USD and 3USD.

A.3.1 Share of positives per class
We provide descriptive statistics on the share of positives per class in the stratified sample for each language in Table 3.

A.3.2 Class co-occurence
In this section, we provide an analysis of the extent to which each class is mutually exclusive.For this, we focus on the English initial stratified sample.
First, the classes "Is Unemployed", "Lost Job" and "Job Search" are non-mutually exclusive in many cases.As expected, the class "Lost Job" is highly correlated with the class "Is Unemployed" with 95% of Lost Job positives being also positives for "Is Unemployed" in the US initial stratified sample (e.g."i lost my job on monday so i'm hoping something would help.","as of today, for the first time in two years.....i am officially unemployed").There are a few exceptions where users get hired quickly after being fired (e.g."tfw you find a new job 11 days after getting laid off "). "Job Search" is also correlated with "Is Unemployed" (e.g."I need a job, anyone hiring?"), though less than Lost Job, with 43% of positives being also positives for "Is Unemployed" in the initial stratified sample.Cases where users are looking for a job but are not unemployed include looking for a second job (e.g."need a second job asap.") or looking for a better job while working (e.g."tryna find a better job").There are also a few ambiguous cases where users mention that they are looking for a job but it is not clear whether they are unemployed (e.g."job hunting") as well as edge cases where users just got hired but already are looking for another job (e.g."i got hired at [company] but i don't like the environment any other suggestions for jobs ?").
For the class "Is Unemployed", mutually exclusive examples are cases where the user only mentions her unemployment, without mentioning a recent job loss or the fact that she is looking for a job (e.g."well i'm jobless so there's that").Second, the classes "Is Hired" and "Job Offer" are essentially orthogonal from one another and from the other classes.The class "Is Hired" (e.g."good morning all.started my new job yesterday.everyone was awesome.") is almost always uncorrelated with the other classes apart from a few edge cases mentioned above.The class "Job Offer" (e.g."we are #hiring process control/automation engineer job in atlanta, ga in atlanta, ga #jobs #atlanta") is almost always orthogonal to the other classes apart from a few exceptions.For instance, it can happen that a user who just got hired mentions job offers in her new company (e.g."if you guys haven't been to a place called top golf i suggest you to go there or apply they are literally the best people ever i'm so happy i got hired").
We detail the class co-occurrence in the US initial stratified sample in Table 4.

A.3.3 Additional descriptive statistics
In this section, we include additional information about the US initial stratified sample.Table 5 contains information on average character length and most frequent tokens per class.Table 6 describes the Part-of-speech tag distribution in positives across classes.

A.4 Pre-trained language model characteristics
To classify tweets in different languages and as mentioned in Section 3.3, we used the following

Class Question
Is Unemployed Does the tweet indicate that the person who wrote the tweet is currently (at the time of tweeting) unemployed?For example, tweeting "Now I am unemployed", or "I just quit my job" is likely to indicate that the person who tweeted is currently unemployed.Lost Job Does this tweet indicate that the person who wrote the tweet became unemployed within the last month?For example, tweeting "I lost my job today", or "I was fired earlier this week" is likely to indicate that the person who tweeted became unemployed within the last month.Job Search Does this tweet indicate that the person who wrote the tweet is currently searching for a job?For example, tweeting "I am looking for a job", or "I am searching for a new position" is likely to indicate that the person who tweeted is currently searching for a job.Is Hired Does this tweet indicate that the person who wrote the tweet was hired within the last month?For example, tweeting "I just found a job", or "I got hired today" is likely to indicate that the person who tweeted was hired within the last month.Job Offer Does this tweet contain a job offer?For example, tweeting "Looking for a new position?",or "Here is a job opportunity you might be interested in" is likely to indicate that the tweet contains a job offer.
Table 2: List of questions asked to the Amazon Turkers when labelling each tweet pre-trained language models from the Hugging Face model hub (Wolf et al., 2020): -Conversational BERT 2 for English tweets, trained and released by Deep Pavlov (Burtsev et al., 2018).This model was initialized with BERT base cased weights and shares the same configuration.It was then further pre-trained using a masked language modeling objective on an English corpus containing social media data (Twitter Reddit), dialogues (Li et al., 2017), debate transcripts (Zhang et al., 2016), movie subtitles (Lison and Tiedemann, 2016) as well as blog posts (Schler et al., 2006).
-BETO for Spanish tweets (Cañete et al., 2020).This model has a BERT-base architecture and was pre-trained from scratch on a Spanish corpus derived from Wikipedia and the Spanish part of the OPUS project (Tiedemann, 2012).
-BERTimbau for Brazilian Portuguese tweets (Souza et al., 2020).This model also has a BERT-base architecture and was pre-trained from scratch on a large multi-domain Brazilian Portuguese corpus called brWaC (Wagner Filho et al., 2018).
All three language models have 110 million parameters.
2 Available at https:// huggingface.co/DeepPavlov/bert-base-cased-conversational When it comes to the choice of language models for each language, the emerging literature considering language model pre-training on tweets to improve downstream tasks in the Twitter context gave us several potential candidates for English tweet classification.On top of Conversational BERT, we experimented with BERTweet (Nguyen et al., 2020), which is the leader on the TweetEval leaderboard3 as of March 2022 (Barbieri et al., 2020).We also tested the performance of renowned pre-trained language models such as BERT base and RoBERTa base.We found that both Conversational BERT and BERTweet outperformed these well-known models for our task.Also, while BERTweet usually slightly outperformed Conversational BERT on the test set from the stratified sample in terms of AUROC, it had a worse performance on the random set R e .This is why we chose Conversational BERT for English tweets.
For Spanish and Brazilian Portuguese tweets, in the absence of Twitter-specialized language models, we opted for the best performing pre-trained language models as of Fall 2020 for these languages, namely BETO for Spanish and BERTimbau for Brazilian Portuguese.We also experimented with multilingual language models, such as XLM-RoBERTa (Conneau et al., 2020), but the monolingual approaches for Spanish and Brazilian Portuguese were performing better, both on the test set from the stratified sample and on the random set.

A.5 Fine-tuning and evaluation
As mentioned in 3.3 and following Dodge et al. (2020), we fine-tuned each BERT-based model with 15 different seeds and for 20 epochs.We evaluated the models 10 times per epoch and use early stopping with a patience of 11.We used a training and evaluation batch size of 8.The best model is defined as the best performing model in terms of area under the ROC curve (AUROC) on the evaluation set, at or after the first epoch.
As described in Algorithm 1, we then ran the inference of the best model on both random sets R e and R s .To speed up this inference process, we converted the PyTorch models to ONNX.
In terms of computing infrastructure, we used either V100 (32GB) or RTX8000 (48GB) GPUs for the fine-tuning and parallelize inference over 2000  CPU nodes.The average runtime for fine-tuning and evaluation on the one hand and inference on the other hand is respectively of 45 minutes and 3 hours.

A.6 Performance at iteration 0
We report detailed AUROC results on the test set from the stratified sample in Table 7.

A.7 Evaluation metrics
In this section, we detail the evaluation process.The values of each metric across iterations for each language and each method can respectively be found in Table 10 and 11.

A.7.2 Average Precision
With the retained tweets, we computed the Average Precision (AP) at each iteration and for each class and language.We used the standard definition of AP in information retrieval and defined AP at iteration i for class c and method m as: where: • R i,c,m is the ensemble of ranks in the confidence score distribution of class c at iteration i and for method m of all tweets sampled for evaluation and labeled for class c and method m both at iteration i and preceding iterations • P (r) is the share of positives in sampled tweets with rank at iteration i and for class c inferior or equal to r • pos(r) is equal to 1 if tweet ranked r for iteration i and class c is positive and 0 otherwise • N i,c,m is the number of tweets sampled and labeled for class c and method m both at iteration i and preceding iterations

A.7.3 Number of predicted positives
We defined the number of predicted positives E as the average rank in the confidence score distribution when the share of positives reaches 0.5.In practice, for each iteration i and class c and the related BERT model M , we first ranked the evaluation set R e according the prediction scores from M .We then binned the evaluation labels of each Figure 4: Illustration of the procedure used to determine the number of predicted positives.In this example, the number of predicted positives is R1 for iteration 1 and R2 for iteration 2.
iteration until i into 20 bins of equal size, and we estimated the proportion of positives in each bin and the average rank of each bin.We then identified the first bin for which the proportion of positive labels reaches 0.5.We estimated an upper and a lower bound for E by taking the average rank of tweets included in the bin above and below the 0.5 cutoff respectively, and we estimated E as the midpoint between its lower bound and its upper bound estimate.For each round, we report E as well as its lower and upper bound estimates.We provide an illustration of this procedure in Figure 4.By convention, the number of predicted positives is equal to 1 when the proportion of positive labels sampled from the evaluation set remains below 0.5 for all ranks.

A.7.4 Diversity of true positives
To compute diversity for a given iteration i and class c, we first encoded all positive tweets sampled for the evaluation of class c at iteration i as well as preceding iterations into sentence embeddings (Reimers and Gurevych, 2019).To do so, we used the "all-mpnet-base-v2" model for English and the "paraphrase-multilingual-mpnet-base-v2" model for Spanish and Portuguese (Reimers and Gurevych, 2020).These models are in open source access on the sentence-transformers GitHub repository4 .
After computing the embeddings, we defined the diversity rate in a set of positive tweets as the mean pairwise distance between all possible pairs in this set.The pairwise distance between tweet A and B is defined as 1 − sim(E A , E B ) where sim is a cosine similarity function and E A and E B are the sentence embeddings for tweets A and B. By convention, diversity is equal to 0 when there is no more than 1 positive label.

A.7.5 Standard error computation
For average precision and diversity, we derived standard errors by using bootstrap samples on the pool of N tweets used to compute the metric.We sampled with replacement N tweets in this pool and repeated the process 1000 times.We then computed the metric for each of these samples and finally computed the mean and the standard error.
For the number of predicted positives, our method does not allow to directly use bootstrap.We therefore computed the upper and lower bound as described in Section A.7.3.

A.7.6 Convergence
As stated in Section 3.5, we considered that an AL strategy had converged when there was no significant variation of average precision, number of predicted positives and diversity for at least two iterations.
To determine whether there is a significant variation in average precision and diversity from one iteration to the next, we performed t-tests.For the number of predicted positives, since we could only estimate an upper and lower bound, we considered that there was no significant variation from one iteration to the next if the interval between the lower bound and the upper bound overlapped from one iteration to the next.
We report in bold the metric values at convergence in Table 10 and 11.A.8 Exploit-explore retrieval algorithm In this section, we detail the functioning of the new AL strategy we coin exploit-explore retrieval in Algorithm 2.
We define the k-skip-n-grams used in this approach as follows: for a given text sequence T , the set of k-skip-n-grams, with k a positive integer and n in {2; 3}, is made of all the ordered combinations of n words in T .For instance, for T = "I am very happy", the set of k-skip-2grams is: { (I, am), (I, very), (I, happy), (am, very), (am, happy), (very, happy)}.The k blanks do not need to be successive.To define the k-skip-n-grams contained in tweets, each tweet was tokenized using the ekphrasis package (Baziotis et al., 2017).
To decide on the 10 4 threshold for top tweets, we estimated the base rate for each class and country.We defined the base rate for a given class as the share of positives for this class in the whole sample of tweets.To estimate this base rate for each class and country, we computed the specificity and frequency of each initial motif (listed in Table 1) and defined the base rate estimate as the sum over each motif of the motif's frequency weighted by its specificity.We detail the estimation results in Table 8.
The base ranks in our random sample of 100 million tweets R e (ie: base rate multiplied by 10 8 ) ranged from 10 2 to 10 5 with a majority below 10 4 in Mexico and Brazil.We tried T = 10 3 , T = 10 4 and T = 10 5 as candidate thresholds for the top tweets and they gave very similar results for the k-skip-n-grams used in the exploration step.We finally chose 10 4 to balance between higher base ranks in the US and lower base ranks elsewhere.Our choice for the other hyperparameters were dictated by our budget constraint.
For illustration of the exploration part of this method, we detail the top-lift k-skip-n-grams selected from US tweets, for each iteration and for each class, in Table 9.

A.9 Calibration for uncertainty sampling
In order to calibrate the BERT confidence scores to do uncertainty sampling, we proceeded in the following way.
For each country, AL strategy and class, we used the 200 tweets we retained along the confidence score distribution on R s and labeled for evaluation.
From this labeled set, we built 10.000 balanced bootstrap samples and fit a logistic regression to each of these samples.We therefore obtained a set of 10.000 logistic regression parameter pairs ((β 0,i , β 1,i )) i∈ [1,10.000] .We then used this set of parameters to find the BERT confidence score x * for which its calibrated version is equal to 0.5.To do so, we used Brent's method (Brent, 1971) and defined x * as the root of the following function: where σ is a standard logistic function.Knowing x * , we were then able to perform uncertainty sampling by sampling tweets with confidence scores around x * .

A.10 Additional experimental results
In this section, we report additional experimental results on precision and average precision.
We report precision for the exploit-explore retrieval strategy across countries in Figure 5 and for the four AL stategies on English tweets in Figure 6.
Also, we detail the evaluation results for the exploit-explore retrieval strategy across countries in Table 10 and for the four AL stategies on English tweets in Figure 11.

Figure 1 :
Figure 1: An example of a tweet suggestive of its author currently being unemployed and actively looking for a job.

Figure 2 :
Figure 2: Average precision, number of predicted positives and diversity of true positives (in row) for each class (in column) for

Figure 3 :
Figure 3: Average precision, number of predicted positives and diversity of true positives (in row) for each class (in column) Proceedingsof the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 986-995, Taipei, Taiwan.Asian Federation of Natural Language Processing.

Table 3 :
Label distribution on the stratified sample for each country and class

Table 4 :
Class co-occurrence in the US initial stratified sample.It reads as follows: out of all positives for the Is Unemployed class, 32% are positives for Lost Job.

Table 5 :
Average character length and top 10 most frequent tokens for each class in the initial US stratified sample

Table 6 :
Part-of-Speech (POS) tag distribution among positives of each class from the initial US stratified sample.The definition of the acronyms can be found here.

Table 7 :
AUROC results on the evaluation set at iteration 0.
For each retained top-lift k-skip-n-gram, sample 5 tweets in R s containing this motif Label sampled tweets for each class; Add new sampled tweets to the set of all labels; Perform new train-test split on this set and use this split to train and evaluate the classifier for the

Table 10 :
Evaluation results using the exploit-explore retrieval active learning method.The results are reported across languages ]