Beyond Good Intentions: Reporting the Research Landscape of NLP for Social Good

With the recent advances in natural language processing (NLP), a vast number of applications have emerged across various use cases. Among the plethora of NLP applications, many academic researchers are motivated to do work that has a positive social impact, in line with the recent initiatives of NLP for Social Good (NLP4SG). However, it is not always obvious to researchers how their research efforts tackle today’s big social problems. Thus, in this paper, we introduce N LP 4 SG P APERS , a scientific dataset with three associated tasks that can help identify NLP4SG papers and characterize the NLP4SG landscape by: (1) identifying the papers that address a social problem, (2) mapping them to the corresponding UN Sustainable Development Goals (SDGs), and (3) identifying the task they solve and the methods they use. Using state-of-the-art NLP models, we address each of these tasks and use them on the entire ACL Anthology, resulting in a visualization workspace that gives researchers a comprehensive overview of the field of NLP4SG. 1


Introduction
With the rapid advancement of natural language processing (NLP) technology (Radford et al., 2018;Devlin et al., 2019;Brown et al., 2020;Ouyang et al., 2022), there has been a growing interest in using NLP for applications with social impact in recent years (Wu et al., 2020;de Mello et al., 2019;Jin and Mihalcea, 2023).This effort has been condensed into the initiative of "NLP for Social Good" (NLP4SG) (Jin et al., 2021), which began with an overall theoretical framework Figure 1: To generate the NLP4SG progress report, we provide a PaperAnalyzer, which identifies social good papers (Task 1), classifies the relevant UN SDGs (Task 2), and analyzes salient scientific terms (Task 3).We process all ACL Anthology papers to create a Sankey diagram (bottom) on our website.(Jin et al., 2021) in 2021, and is reaching broader community impact through various workshops at NLP conferences (Field et al., 2021;Biester et al., 2022).
However, most of the efforts so far have consisted of disparate community discussions, opinion pieces, and projects, yet without a unifying community-based insight.This is partly due to a lack of understanding of how NLP research efforts align with social good implications.We argue that it is critical to have community reports based on solid empirical evidence, similar to the United Nations' (UN) annual compilation of the report on Sustainable Development Goals (SDGs) (e.g., United Nations, 2021Nations, , 2022)).Such progress reports for NLP4SG can help calibrate intuitions with facts and (1) inspire the NLP community to understand how well NLP research address social good; (2) raise awareness of under-or unaddressed areas; (3) serve as a reference for researchers who start in the field when deciding on what topics to take on; and (4) foster better connections between the research community and social science experts, allowing for more impactful research and feedback to improve NLP4SG.
To this end, we design a new suite of tasks for analysing NLP4SG papers, spanning from identification of NLP4SG papers, classifying them into SDGs, to identifying the task and methods used in them.We create NLP4SGPAPERS, a scientific dataset of 5,000 papers annotated with the above three tasks, and use the dataset to build an "NLP for NLP" PaperAnalyzer system (Figure 1) that applies the state-of-the-art NLP models to the NLP papers.
Using our PaperAnalyzer system, we parse the entire database of 76K ACL Anthology papers, and find the following trends: (1) 13.07% of the papers are related to social good; (2) healthcare and education are popular goals among NLP4SG papers, but goals such as poverty and hunger are largely unaddressed; (3) most common tasks addressed by the community in NLP4SG are machine translation, text classification, and toxicity detection, and although the most common models are large language models (LLMs) (e.g., Radford et al., 2018;Devlin et al., 2019;Liu et al., 2019), these innovations have not been deployed in all tasks.Q1 Overall estimate.How large is the proportion of NLP4SG papers among all NLP papers?How does this proportion change over the years?Q2 Social good aspects.Among NLP4SG papers, what specific social good aspects do they tend to address?Q3 Technical solutions.For each social good goal, how are the papers addressing them?For example, what tasks do people formulate, and what models do people use?Is there an overlooked space?
Properties of a Good Progress Report.After settling in the key questions that the report should aim to answer, we propose several criteria for a good automation of the NLP4SG progress report.
A good report should (P1) comprehensively address the questions; (P2) base its answers on representative, large data; (P3) require little repetitive manual effort, ideally largely automated; (P4) be transparent and open-sourced so that it is trustworthy; (P5) be reader-friendly for both technical and non-technical audiences; and (P6) be easy to access for all parties, including but not limited to researchers, social good organizations such as NGOs, funding agencies, and the general public.
Missions.This report aims to serve several purposes.First, we want to convey an overall positive and encouraging message that there is a growing and diverse community of researchers who are interested in NLP4SG.Second, we want to improve the cohesion of this community, by making the information on the existing research landscape easier to access, so that people can build on each other's research.Third, we aim to inspire more people to develop novel tasks and use cases to fill in the under-addressed social impact areas.Finally, we want to make the current progress transparent and public to all interested parties, especially non-research parties, including organizations such as NGOs working on NLP4SG, decision-makers, and the general public.By doing this, we can better connect the research community with all other related communities, not only to broadcast the impact of NLP4SG to the broader community, but also to help improve NLP research with external feedback, and formulate research questions more aligned with real-world problems.
Our Proposed NLP-for-NLP pipeline.We develop an NLP-for-NLP pipeline to automatically analyze papers for the NLP4SG report, as shown in Figure 2. As part of this pipeline, we develop a PaperAnalyzer system consisting of various models trained to answer the three questions (Q1-Q3).
The pipeline consists of the following steps: (1) We annotate various datasets that can help us answer the three questions (meeting P1 and P2). ( 2) We use this data to train and evaluated our Paper-Analyzer models.(3) Next, we run our NLP4SG PaperAnalyzer (which meets P3) on a large repository of NLP papers, and compose a comprehensive NLP4SG progress report, together with an opensource GitHub release of our code (which meets P4) and an interactive interface whose screenshots are in Appendix A (which meets P5 and P6). ( 4) Finally, we distill some key findings for the NLP community, to pave the way for more cooperative, evidence-guided community efforts to fuel NLP4SG.

NLP4SGPAPERS Dataset
We build a NLP4SGPAPERS dataset to address the following three tasks as illustrated in Figure 1: • Task 1: Binary classification of NLP4SG papers (for Q1).• Task 2: Multi-label multi-class categorization of NLP4SG papers into the 17 UN SDGs (for Q2).• Task 3: Identification of salient scientific terms (tasks and methods) in NLP4SG papers (for Q3).
Next, we introduce the data collection process, and then analyze the data statistics for each task.

Dataset Collection
Data Source and Preprocessing.We identify the ACL Anthology as an ideal data source.We obtain 76,229 papers from ACL Anthology that were published by May 27, 2022.2We randomly sample 5,000 papers uniformly over the entire ACL Anthology and parse their titles and abstracts.See Appendices B.1 and B.2 for discussions about the data sources and our data preprocessing details.We split the dataset into 2,500 train samples, 500 development samples, and 2,000 test samples.Statistics of the data are shown in Table 1.
Task Formulation.Let us denote our dataset as D := {x i } N i=1 consisting of N papers, each paper x := (t, a) contains a title t and abstract a.
Task 1 is a binary classification task f 1 : X → Y, which maps each paper to a binary label space Y = {0, 1} representing whether a paper address a social good task (y = 1) or not (y = 0).
Task 2 is a multi-label multi-class classification task f 2 : X ′ → Z, which maps NLP4SG papers X ′ = {x|f 1 (x) = 1, x ∈ D} to a space Z, which is the power set of all 17 UN SDGs.
In Task 3, we extract the main NLP tasks addressed and methods used in the NLP4SG papers, f 3 : X ′ → W t ×W m , where W t and W m represent the space of the power sets of the main tasks and methods in NLP4SG papers, respectively.Given our desiderata, it is better to make the extracted terms easy to use for summarizing trends in our Sankey diagram (Figure 1).Inspired by the annotation scheme of Papers with Code, 3 we use task names such as "machine translation" and "toxicity detection"; and we use method names such as "BERT-based models," "long short-term memory networks (LSTMs)," and "support vector machines (SVMs)."We provide the detailed list in Appendix B.3.4.This allows us to use Task 3 to provide evidence that can help us answer highlevel questions such as what NLP4SG work uses what type of technology, as shown in Figure 1.
Data Annotation.The data associated with each task was annotated by two NLP researchers who are proficient in English.We provide detailed annotation guidelines for each task in Appendix B.3.Briefly, the inter-annotator agreements are 92.93%Cohen's kappa for Task 1, 88.67% Cohen's kappa for Task 2, and achieve a high similarity score of 84.63% BERTScore (Zhang et al., 2020) between the annotator's responses for Task 3. We resolved cases of disagreement by first asking them to review potential oversights, and then clarifying the annotation principles for samples that are ambiguous.With these clarifications, the annotators converged to an agreement through re-iteration and discussions.

Dataset Characteristics
Data Overview for Task 1.We show the overall statistics in Table 1.We can observe that 11.84% of the papers are related to social good (592 papers of our sample of 5000).Data Overview for Task 2. For Task 2, we first show in Table 2 the distribution of all 17 UN SDGs among all the 592 social good-related papers.The distribution is strongly skewed, with some goals covering the majority of the data.For example, 3 https://paperswithcode.com/ the top three goals -G3 Health (34.12%),G16 Peace (31.93%), and G4 Education (16.22%)cover a total of 82.27% of the NLP4SG papers.Many goals rarely occur or are even not present in the dataset, such as G1 Poverty and G2 Hunger.This distribution suggests either a lack of attention from the NLP4SG community to some of these UN goals or challenges in addressing these goals through NLP.Additional analyses are presented in Section 5.
To enrich the data for low-or zero-occurrence classes, we perform upsampling to increase the number of samples corresponding to all the classes that constitute less than 5% of our dataset.Specifically, we take the unannotated part of the ACL Anthology, run the best model for Task 1 to identify social good papers, and use SDG-specific keywords to obtain highly possible paper candidates for the under-addressed SDGs.Then, we manually check the SDGs that these papers address.By this procedure, we enrich the original data by 167 more papers, which are more distributed over the low-or zero-occurrence classes.Details of the upsampling procedure are in Appendix B.3.3.
We report the statistics of the upsampled dataset in Data Overview for Task 3. Since there has been significant previous work (Jain et al., 2020;Luan et al., 2018) on extracting salient information from scientific literature, we only annotate test data for this task to evaluate existing models in our domain of NLP4SG.On average, the spans of task terms are 3.6 words long, and the spans of method terms have 2.4 words long.Also, each paper has an average of 1.06 tasks and 1.5 methods, and the entire set has a total of 263 unique tasks and 335 unique methods.

NLP4SG PaperAnalyzer
Our PaperAnalyzer system works as shown in Figure 1, with three sequential steps: we first train a binary classifier (addressing Task 1), followed by a UN SDG goal classifier (addressing Task 2), and finally processing the paper with a pretrained task and method extractor (addressing Task 3).
We use distant supervision (DS) for data augmentation when finetuning the models.To obtain dis-tantly supervised data, we use two methods to identify additional NLP4SG papers from the unannotated set of almost 76K papers.We first apply keyword matching using a list of curated NLP4SG keywords such as "healthcare" and "education", and then we also include papers with high cosine similarity with the textual descriptions of the UN SDGs.See the full list of keywords and details of how we compile the distantly supervised data in Appendix D. Table 3: Model performance on Task 1, including the random baselines, whose distribution are proportional to the label distribution or uniform; as well as common LLMs, with and without our proposed DS technique.We report the F1 score, precision (P), and recall (R) of the positive class, as well as accuracy (Acc).For the best-performing model, SciBERT+DS, we also conducted ablation studies removing the different subsets of our DS data.
Results.We show our model performance for the binary classification task in Table 3.The bestperforming model is SciBERT+DS, which is enhanced by the addition of our DS technique.DS contributes 2.56 percentage points (pp) improvement over the baseline SciBERT model on fully supervised data only.Moreover, with the ablation study, we show that by using only DS by keyword we get an improvement of 1.26 pp on the F1 score.Using only similarity DS does not improve the F1 score, but causes an increase in recall of 7.43 pp.Both DS techniques combined help the model to increase the F1 score to 75.98%.We choose the model with the best F1 score because we want to get a precise but also a complete report of the NLP4SG landscape.
Interpretability & Error Analysis.To understand the working of our model, we apply a commonly used interpretability tool, the local interpretable model-agnostic explanations (LIME) (Ribeiro et al., 2016) to our best performing model, SciBERT+DS.LIME outputs the words that our model relies on as positive and negative evidence for the classification.Some common positive keywords for our model include "biomedical", "students", and "social", while some negative keywords are "lexical", "parsing" and "grammar".The details of our LIME implementation and more visualized examples can be found in Appendix E.1.
Our model makes a few common error types, including: (1) For false positives (FP), many papers include the common word "social", which appears in 10.10% of the FP papers.We suspect that this is due to the frequent use of "social media", which is not necessarily related to social good, and the fact that "social" is a good indicator of an NLP4SG paper since it appears in 15.03% of the true positive papers.(2) Among false negative (FN) samples, a large subset is innovation and educationrelated papers, constituting 30.30% and 27.27% respectively among the FN, larger than their proportion (13.55% and 23.90%) in the entire test set.A reason might be that the frequent use of words like "learning" and scientific vocabulary makes our model struggle identifying a research paper that uses machine learning vs one addressing education topics.

UN SDG Goal Classifier
Experimental Setup.For Task 2, we evaluate zero-shot models using all the 759 annotated NLP4SG papers as our test set, as it is challenging to train classifiers for 17-class classification with large imbalance that results in several lowoccurrence classes.Specifically, we first adopt the pretrained MNLI classifiers, including BART (Lewis et al., 2020), DistilBERT (Sanh et al., 2019), and DeBERTa (He et al., 2021).We also include the InstructGPT model (Ouyang et al., 2022), which we ask to generate a list of relevant SDGs for the paper.More experimental setup details are in Appendix D.2, including the prompts, implementation details for the models, and the evaluation metrics.
Results & Error Analysis.We evaluate all models and report their results in Table 4.Among all the models, InstructGPT has the highest perfor-mance, scoring 69.37% F1.It also has the highest performance on all other measures, such as partial list match with 74.57% of the samples, and exact list match with 66.66% of the samples.We also perform a more detailed breakdown of the performance of our best-performing model by different UN SDGs in Table 5.We can see that it is easy for the model to distinguish goals such as health, gender equality, and climate.However, other goals such as social inequalities and innovation are ambiguous for the model.One possible reason is that some NLP papers use domainspecific words such as "bias," which can also mean data bias, or spurious correlations in the data, so the model might confuse it with the general meaning of bias towards or against certain groups of people in society.

Task and Method Analyzer
Experimental Setup.Although our goal is to extract the main tasks and methods for our Sankey diagram, our setting is novel and distinct from tra-ditional sentence-level named entity recognition (NER), which aims at span-level extraction of all named entities.Instead, our setting is documentlevel as we just need one mention of the same term in case there are co-references or paraphrases.We also accept generative answers if there are no appropriate existing spans.Thus, we evaluate Task 3 using common metrics for extractive and generative question answering (QA) (Rajpurkar et al., 2016;Chen et al., 2019;Sai et al., 2023), such as exact match, F1, and BERTScore (Zhang et al., 2020).

Insights
Using our NLP4SG PaperAnalyzer system, we analyze all the 76K NLP papers from ACL Anthology to answer our driving questions Q1, Q2, and Q3.Additionally, we also build a demo website at nlp4sg.vercel.app to make our analysis results more accessible to all parties interested in NLP4SG.Our website starts with a homepage in Figure 4, whose visualization plots we will refer to for the related analysis in this section.For a full overview of various demos on the website, see

Q1. How Many NLP4SG Papers Are There?
We apply our NLP4SG binary classifier to the entire ACL Anthology with 76K papers.Overall, there are around 13.07% social good-related papers.For a more detailed overview, we plot the proportion of social good-related papers among all NLP papers every year from 1980 in Figure 5.We can see a clear rising trend of NLP4SG in recent years.Notably, the proportion of NLP4SG papers increased to 19.0% in 2020 from the initial percentage of 8.2% in 1980.It is important to note that while these percentages provide us with directional insight into the trend, their accuracy is contingent on the precision of the model used for analysis.the middle is compiled from our survey on "How current NLP researchers think about NLP4SG" at ACL 2021 of 80 NLP researchers, where we ask the subjects "What social good applications do you think the NLP community should consider?"and then map the answers (e.g., NLP for education, and combating misinformation) to corresponding SDGs.The survey details are in Appendix F.1.Finally, we apply our UN SDG goal classifier to all the ACL Anthology papers, and visualize the distribution of the 17 goals into the right panel of Figure 6.Our demo website 4 contains a version of this tri-view comparison in Appendix A.1, where the user can specify whether they want to look at the overall trend, or the trend in any specific year.

Q2. What Aspects of Social Good
From the overall comparison in Figure 6, we can 4 https://nlp4sg.vercel.app/sdgsee a strong mismatch between the importance of the goals (left) and the actual size of research efforts (right).For example, NLP4SG papers mostly address health, peace, and education, which are admittedly important goals.However, they hardly address other important goals such as poverty and climate, which are pressing social issues.This might be because it may be more challenging to address these topics, e.g., lack of available datasets and expertise, trends dominating researchers' attention, funding climate (e.g., healthcare research being better funded), ease of publishing papers, the popularity of the field which may help gain citations more easily, the existence of good benchmark datasets, and so on.Note that we intentionally avoid getting into technological solutionism, and a more accurate framing should be trying to make the best use of the space where NLP can help (Appendix F.2), and the first two bar charts are all references to help gain better insights into this.On our Sankey diagram demo page,5 we also enable clicks into each goal, task, or method to see the corresponding list of papers.We show the titles and abstracts of the papers and highlight text spans that contain the tasks and methods, as illustrated in Figure 7.This function is helpful to facilitate NLP4SG literature review, and saves time for future researchers.6 Related Work AI for Social Good.Recent years have seen a rising effort to apply AI technologies to various aspects of social good (e.g., Chen et al., 2020;Yu et al., 2018).To enhance the community which cares about AI for social good (Tomašev et al., 2020;Jin et al., 2021;Hager et al., 2019), it is crucial to gain insights about the progress overview.However, existing efforts mostly rely on timeconsuming manual compilation, from surveying hundreds of papers (Cowls et al., 2021;Jin et al., 2021), conducting limited case studies (Floridi et al., 2020;Hager et al., 2019), to pairing existing papers on tasks and datasets with SDGs (Yeh et al., 2021).Some work uses keyword matching to extract social good-related papers (Shi et al., 2020;Fortuna et al., 2021), but our work is the first to build a comprehensive paper analysis system, and we are the only work designing a suit of various tasks, annotating a dataset, and using state-of-theart NLP models to support such analysis.

Q3. What Technical Solutions
NLP on Scientific Literature. is an active research area on building NLP models for analysing scientific text.Example research directions include dataset creation (Ammar et al., 2018;Lo et al., 2020;Wang et al., 2020); modeling for tasks such as information extraction (Jain et al., 2020;Luan et al., 2018;Zhong and Chen, 2021;Hope In this work, we make possible the reporting of the state of NLP4SG, with the help of several novel tasks specific to NLP4SG papers, the construction of a richly annotated NLP4SGPAPERS dataset, and the development of a PaperAnalyzer system based on state-of-the-art LLMs.We use this system to provide researchers insights into the landscape of NLP4SG research, paving the way for a more evidence-driven pursuit of NLP4SG that goes beyond good intentions.To make this information more accessible, we also create a website to visualize our analysis.

Limitations
One limitation of this work is that the models are not yet perfect, so the insights on the entire ACL Anthology are bottlenecked by the best model performance.To make up for this, we also provide in Appendix G.2 an additional analysis with only the gold, annotated data, which are accurate, but smaller and less representative of the entire landscape of NLP4SG.Since our system is modular, and we can plug in better models for each task, we welcome future work to further improve the current models.One example improvement direction is to model more inputs than the title and abstract, for instance by taking into consideration the full text and meta information such as the venue and publication year of the paper.
Our sampling technique is designed to ensure the representativeness of our sample within the larger population of academic papers in our dataset, in line with our primary objective to reflect the distribution of each SDG.We acknowledge that there may be variations in the distribution of papers over the years, particularly in earlier years with limited number of papers.In future work, a more detailed analysis could be conducted to track the distribution of NLP4SG papers over time, providing a more nuanced understanding of how these papers have evolved and complementing our broader findings regarding SDG representation.
Another limitation is that the definition of social good might experience a gradual shift when society evolves its ethical standards.There is no fixed, definitive answer, so we welcome community discussions.This paper is a snapshot of existing thinking frameworks (Cowls et al., 2021;Jin et al., 2021) and our understanding by 2023.Updating the definition of social good is key to mitigating the risks of overlooking important societal problems.In the long run, we suggest communitywide discussions and iterations to update the definition of social good over time.
Finally, beyond the three tasks we propose for NLP4SG papers, there could be potentially other tasks, such as generating task formulations for under-addressed goals, suggesting new taskmethod combinations, and so on.For example, some NLP tasks that could address issues related to poverty could include sentiment analysis to understand public perceptions of poverty and related issues, information extraction from government documents to identify and track funding for poverty reduction programs, and machine translation to make information about poverty reduction programs and resources available to non-English speakers.Moreover, currently our work is limited to the knowledge within NLP, so it will be better if future work can collect data about the downstream real-world impact of each research trace, which can provide better feedback for the community.

A Overview of the Demo Website
We create a demo website at https:// nlp4sg.vercel.app/.Our website starts with a homepage in Figure 4, which provides access to all the analysis visualization plots made in Javascript.We introduce each visualization function in detail below.

A.1 Visualization 1: SDG Alignment
We first provide a visualization in Figure 8 of the SDGs (Section 5) from three aspects: the importance scores of SDGs collected in a survey on sustainability researchers (Yang et al., 2020), NLP researchers' opinions, and the number of ACL Anthology papers addressing the goals.In addition to the static plot in Figure 6, we enable a scroll bar on the top to select the view of papers only in a certain year at https://nlp4sg.vercel.app/sdg.

A.3 Visualization 3: Paper Browser
Corresponding to the Sankey diagram, we also enable clicks into each goal, task, or method to see the corresponding list of papers, an example of which is shown in Figure 7 (https:// nlp4sg.vercel.app/sankey).We show the titles and abstracts of the papers and highlight text spans that contain the tasks and methods.This function is helpful to facilitate NLP4SG literature review, and saves time for future researchers.

A.4 Visualization 4: Supporting Organizations
To understand the support structure behind the pursuit of the SDGs, we collect the organizations supporting each research paper.We ran named entity recognition models by the Stanza Python package (Qi et al., 2020) over the Acknowledgment section.In case no organizations can be identified, then we use hand-crafted rules such as taking the main organization name of the authors if it is not a research institute, as these organizations could be self-supported.We visualize the distribution of organizations in the forms of a word cloud in Figure 9 and a Sankey diagram corresponding to each SDG in Figure 10.

A.5 Visualization 5: Paper Cluster Visualization
We also make a bottom-up plot of all the papers at a glance at https://nlp4sg.vercel.
app/papers.Specifically, we use the text embeddings of the papers by sentence transformer (Reimers and Gurevych, 2019),6 project them using t-SNE (Rauber et al., 2016), and cluster by the HDBSCAN clustering algorithm (Campello et al., 2013).A screenshot of the paper clustering visualization page is in Figure 11.

B Data Collection Details B.1 Data Source
We identify ACL Anthology as a quality source of data, since it contains a large set of peer-reviewed NLP papers.There are other potential sources of peer-reviewed NLP papers, such as papers submitted to the NLP track at other venues such as NeurIPS and ICLR; as well as non-peer-reviewed papers on arXiv under the category cs.CL.However, for the former, it is hard to algorithmically identify NLP papers with high precision and recall, not to mention the different copyright regulations; for the latter, it is hard to control the paper quality if we open to non-peer-reviewed data.
After balancing all the concerns and also thinking of our target audience, the NLP community, we think ACL Anthology, which is peer-reviewed and maintained by the NLP community, is a good data source for our work.

B.2 Preprocessing
We conduct the following preprocessing on the ACL Anthology data.To identify the paper entries in the entire anthology, we remove those that are workshop proceedings, invited talks, and book reviews, by a set of hand-crafted rules, such as identifying the title starting with "Book Reviews: . . ." or ending with ". . .Workshop."To clean the titles, we remove the special symbols "{" and "}" in titles such as "{N}o{P}ropaganda at {S}em{E}val-2020 Task 11: A Borrowed Approach to Sequence Tagging and Text Classification." Moreover, not all entries on the anthology provide a non-empty abstract, so for papers with missing abstracts, we parse their PDFs using the PDF-to-JSON converter doc2json Python package7 to extract the abstracts.

B.3 Annotation Scheme
We introduce the annotation scheme for the three tasks below.

B.3.1 Annotating Task 1: NLP4SG Classification
Inspired by previous frameworks defining NLP4SG (Jin et al., 2021) and AI for social good (Cowls et al., 2021), we design a decision flowchart to identify whether an NLP paper addresses a social good-related problem in Figure 12.Specifically, there are two important decision criteria.One is about the relevance to social good topics, e.g., the 17 UN SDGs.We also include social problems in the digital era, such as online toxicity, fake news, and privacy, into their corresponding SDG, i.e., (Goal 16) Peace, justice and strong institutions.The second criterion is about how directly the paper addresses the social problem.For example, general linguistic tools such as syntax parsing could be an important technology, but not carrying direct social implications.We adopt the categorization of the stage of a technology in Jin et al. (2021), and only count those with a clear implication for downstream applications (i.e., Stage 3 and 4).
To get the idea through to the annotators, we provide both the formal guideline above, and also concrete examples as follows.
What is included: What is excluded: (Example set 1) Generalpurpose, coarse-grained NLP tasks: machine translation, language modeling, summarization, sentiment analysis, etc. (Example set 2) Generalpurpose, fine-grained NLP tasks: news classification; humor detection; technologies to increasing productivity, e.g., email classification, report generation, meeting note compilation (because they are application-agnostic which could be used for both good and bad purposes, and also a bit too general); textbook-related QA but using it as a benchmark to improve the general modeling capabilities; tasks whose data is socially relevant, but the task is neutral (e.g., POS tagging for parliament speech); NLP to help other neutral disciplines, e.g., chemistry; tasks a bit too indirectly related to SDGs, e.g., parsing historical language document, or cultural heritage-related tasks, lowresource MT, which bridges resources from one community to another, but is a bit too indirect, and also depends case by case on the actual language community, plus there is a tradeoff between efficiency and equality.(Example set 3) Tasks with controversial nature or unknown effect (varying a lot by how people use them in the future): e.g., news comment generation; financialNLP, which could be used in either way to help the economy, or perturb the market for private profits; simulated NLP tools for the battlefield; user-level demographic prediction.
As mentioned in the Limitations and Ethical Implications sections, the definition of social good might experience a gradual shift when society evolves its ethical standards.There is no fixed, definitive answer, so we start with relatively inclusive criteria here, and welcome community discussions and iterations to improve the definitions.This paper is a snapshot of existing thinking frameworks (Cowls et al., 2021;Jin et al., 2021) and our understanding by 2023.In the long run, we suggest community-wide discussions and iterations to update the definition of social good from time to time.

B.3.2 Annotating Task 2: SDG Classification
We let the annotators read the official descriptions and indicators of the 17 SDGs on the UN website such as https://sdgs.un.org/ goals/goal1.For each paper identified as NLP4SG in Task 1, we ask the annotators to label all relevant SDGs.
We present examples of NLP4SG paper titles corresponding to each SDG below.

G2. Hunger
• A Gold Standard for CLIR evaluation in the Organic Agriculture Domain • CRITTER: a translation system for agricultural market reports

G3. Health
• A Treebank for the Healthcare Domain To enrich the data for Task 2, we upsample the low-occurrence classes.We denote all the classes with less than 5% proportion in our dataset as the low-occurrence classes.We first automatically extract candidate papers that might correspond to the low-occurrence classes, and then manually annotate their SDG classes.To automatically identify a candidate pool of papers regarding the lowoccurrence classes, we first run the best model of Task 1 on the entire ACL Anthology to extract 9K NLP4SG papers, and among them, we use our curated keywords corresponding to the SDGs to extract candidate papers.
In Table 7, we can see that by the automatic matching algorithm, we get 261 candidate papers in total, including 1 candidate paper for poverty, 25 for hunger, 67 for gender, ..., and 40 for partnership.And after manually annotating these 261 papers, we obtain 4 confirmed papers for hunger, 67 for gender, 1 for water, 44 for economy, ..., and 5 for partnership.We use this upsampled data as additional data for Task 2, resulting in 759 total NLP4SG papers for Task 2.

B.3.4 Annotating Task 3: Salient Scientific Term Analysis
We present the title and abstract of social good papers to the annotator and ask them to write down a list of the primary tasks and methods addressed by each paper.In case there are hyponymy and hypernymy relationships among the scientific terms, we take the more specific term that is still common enough to show up in the Sankey diagram.
For example, we would use the term LSTMs, but not neural networks (too broad), or stacked BiL-STMs with co-attention (too specific to be frequent enough to show up on the Sankey diagram).We utilized this information to enable the annotators to choose the relevant task and method spans from the title and abstract, along with an extra text box as an option in case an exact span was not available.
Below are the list of some common tasks and methods.
List of common tasks: automatic speech recognition, COVID-19-related analysis, text classification, data collection, event extraction, fact checking, fake news detection, gender bias mitigation, information extraction, information retrieval, linguistic analysis, machine translation, NLP applications, named entity recognition, natural language generation, question answering, relation extraction, rumor detection, sentiment analysis, stance detection, text summarization, toxicity detection, etc.
List of common methods: BERT, classifiers, computational models, conditional random fields, convolutional neural networks, ensemble methods, LSTM, multi-task learning, RoBERTa, supervised learning, support vector machines, topic models, transfer learning, word embeddings, etc.In the absence of more specific methods, it is also acceptable to annotate more coarse-grained methods if they are the only ones mentioned in the paper: language models, language technology, machine learning methods, machine translation system, NLP models, neural networks, recurrent neural networks, transformers, etc.
Title: Applying deep learning on electronic health records in Swedish to predict healthcare-associated infections Abstract: Detecting healthcare-associated infections poses a major challenge in healthcare.Using natural language processing and machine learning applied on electronic patient records is one approach that has been shown to work.However the results indicate that there was room for improvement and therefore we have applied deep learning methods.Specifically we implemented a network of stacked sparse autoencoders and a network of stacked restricted Boltzmann machines.Our best results were obtained using the stacked restricted Boltzmann machines with a precision of 0.79 and a recall of 0.88.

B.4 Annotator Background
The data associated with each task was annotated by two NLP researchers who are proficient in English.There are both female and male researchers, as well as white and Asian demographic backgrounds.

B.5 Inter-Annotator Agreement
For Task 2, we computed the Cohen's Kappa score of the binary classification of each SDG as shown in Table 9, and calculate the weighted average as 88.67%.Moreover, the partial match score is 96.46% and the exact match is 85.35%.In Figure 13, we show the co-occurrence matrix with the absolute number of co-occurrences.For the keyword matching, we curate a list of social good-related keywords provided in our GitHub repository which extends the AI for social good keywords by Shi et al. (2020).We compare the performance of keyword matching based on the title vs. title and abstract in Table 10, and find that keyword matching in titles gives higher performance, so we adopt the title keyword matching as the first DS method.

C More Dataset Information
For the second DS method, cosine similarity matching, we calculate the cosine similarity scores between the text embeddings of the title + abstract of the paper and the text embeddings of the description of each one of the UN SDGs and take the most similar UN SDG.We add a paper to the positive set if its text similarity is among the top 1% most similar papers in the unlabeled data from the ACL Anthology.We add a paper to the negative set if its similarity is in the bottom 50% percentile.For the text embedding similarity, we use a general-purpose sentence similarity model, all-distilroberta-v1 8 which is a pre-trained distilroberta-base (Sanh et al., 2019) model finetuned on a 1B sentence pairs dataset.
Merging both DS sets, the entire DS set consists of positive samples from papers whose titles match the keywords, and papers which are very similar the UN SDG descriptions; and negative samples from papers that are very different from the UN SDG descriptions.
We report the quality of data extracted by these DS methods evaluated on the training set in Table 10, which correlates with the ablation study results in Table 11.We can see that the performance of the entire DS set is higher than that of the keywordonly set, which is higher than the similarity-only set.10, where keyword + similarity gets the best quality, and similarity only gets a very low quality.

D.1.2 Models Implementation Details
BERT, RoBERTa, and SciBERT We finetune SciBERT (Beltagy et al., 2019) on the NLP4SG text classification task.We freeze the embedding layer and the first 11 encoder layers, and we finetune the parameters of the 12th encoder layer, the and the classifier layer.
We use one GPU model NVIDIA Quadro RTX 6000 with 24 GiB.We train the classifier for 15 epochs using a batch size of 32.The train-8 https://huggingface. co/sentence-transformers/ all-distilroberta-v1 ing time for our best-performing model is approximately 1.5 hours.We manually grid search with different learning rates (5e-03, 5e-04, 5e-05, 5e-06) and different scheduler types ("constant","linear","cosine with restarts") to make a total of 12 runs and chose the best hyperparameters based on f1 score on the dev set.The chosen learning rate is 5e-05 and a linear scheduler.All the runs together make a total of around 18 GPU hours.We tried with a similar setting using BERTlarge and RoBERTa-large, obtaining lower performances.
InstructGPT For the second approach, we design a prompt and ask InstructGPT in a zero-shot setting if a paper is NLP4SG or not.The prompt contains 3 elements: • Introduction to the task: We use the sentence "There is an NLP paper with the title and abstract:" • Content of the paper: The title and abstract concatenated • Final question: We use the sentence "Is this paper contributing to the UN Sustainable Development Goals? Answer yes or no." We use the OpenAI API9 to access GPT.We use "davinci-text-002."We keep the default values of the API, and set the temperature to zero and maximum number of tokens of the response as 50.
Few-shot InstructGPT We also perform experiments using few-shot settings, i.e., 2,4,8,16, and 32 shots, and report the results for the setting with the best performance in the dev set.For those experiments, we only use the title of the paper and the same proportion of positive and negative examples.
Are We parse the response, allowing multiple goals for each paper.
Model 2: Pretrained MNLI classifier We use a pretrained MNLI classifier (bart-large-mnli) as a zero-shot text classifier, an approach proposed by Yin et al. (2019).We get the probability of entailment/contradiction of the content of the paper with the description of each one of the UN goals obtained from the UN website.We consider that a paper contributes to a UN SDG if the probability of entailment is 0.5 or higher.If there is no SDG with a probability higher than 0.5 we take the SDG with the highest probability.We also tested distilbert (typeform/distilbert-base-uncased-mnli) and DeBERTa (MoritzLaurer/DeBERTa-v3-basemnli-fever-anli).

D.3 Models for Task 3
PURE and SciREX PURE (Zhong and Chen, 2021) is a model for end-to-end relation extraction.We use PURE trained on the SciERC (Luan et al., 2018) dataset of scientific abstracts.
SciREX (Jain et al., 2020) is a document level IE model that not only identifies mentions, but also models their saliency, and their coreference links.
For SciREX and PURE, we run them on a GPU model NVIDIA Quadro RTX 6000 with 24 GiB.It takes around one hour to extract the information from 9K social good papers.
InstructGPT We use InstructGPT in a zero-shot setting in order to extract the relevant information.We use a different prompt for each of the 2 extraction tasks (tasks and methods).The prompt consists of 3 elements:     For the correctly classified samples by the best In-structGPT method, we perform further analysis to check the explanation quality of the model.Specifically, we denote a good "impact chain" as a complete and cohesive reasoning chain from the direct description of the paper's research output to intermediate impact, and end impact.Then we randomly select 200 test samples, and use the following two criteria to check the explanation quality of InstrutGPT: (1) whether the mentioned "impact chain(s)" sound cohesive and reasonable, e.g., not missing necessary intermediate reasoning steps, which is analogous to the precision score, and (2) whether there are other obvious impact chains that are missing, which is analogous to the recall score.For each criterion, we assign 1 for good performance, 0 for bad performance, and 0.5 for partially good performance.

We show two motivating examples in
We find that 77.6% of the explanations contain cohesive and convincing reasoning chains, and 18% of the explanations miss at least one obvious impact chain of how the paper contributes to UN SDGs.

SDG classification explanation
There is an NLP paper with the title and abstract: [Title + Abstract] Is this paper contributing to the UN Sustainable Development Goals? Answer yes or no.If the answer is "yes", mention which goal the paper is contributing to and in which way it is contributing to that goal.We conduct a survey from July 31, 2021 to August 31, 2021 about "How current NLP researchers think about NLP4SG."We receive 80 responses from NLP researchers, among which more than 30% are NLP professors.We reach the subjects at the ACL conference and also through emails.The demographics of the survey subjects are shown in Figure 18.
An Example with Sufficient Reasoning Steps for the Impact Chain Title: Extracting Symptoms and their Status from Clinical Conversations Abstract: This paper describes novel models tailored for a new application, that of extracting the symptoms mentioned in clinical conversations along with their status.[...] We propose two novel deep learning approaches to infer the symptom names and their status: [...] This task stems from a realistic application of assisting medical providers in capturing symptoms mentioned by patients from their clinical conversations.[...] InstructGPT Response: Yes, this paper is contributing to the UN Sustainable Development Goal of Good Health and Well-Being.The paper describes models that are designed to extract symptoms from clinical conversations, which can assist medical providers in capturing symptoms mentioned by patients.This can ultimately lead to better diagnosis and treatment of diseases, and improve overall health and well-being.An Example Lacking Intermediate Reasoning Steps for the Impact Chain Title: Extracting Symptoms and their Status from Clinical Conversations Abstract: Lexically-Triggered Hidden Markov Models for Clinical Document Coding.The automatic coding of clinical documents is an important task for today's healthcare providers.Though it can be viewed as multi-label document classification, the coding problem has the interesting property that most code assignments can be supported by a single phrase found in the input document.We propose a Lexically-Triggered Hidden Markov Model (LT-HMM) that leverages these phrases to improve coding accuracy.[...] InstructGPT Response: Yes, this paper is contributing to the UN Sustainable Development Goal of good health and well-being.The paper is doing this by proposing a Lexically-Triggered Hidden Markov Model (LT-HMM) that leverages phrases to improve coding accuracy for clinical documents.This will ultimately help healthcare providers provide better care to patients.
Table 12: Examples of sufficient and insufficient reasoning about the impact chain of NLP4SG papers.
The question in the survey that contributes to the SDG visualization is "What social good applications do you think the NLP community should consider?",for which we manually parse the openended answers to corresponding SDGs.Example answers are NLP for Education (22.9%), physical and mental health (18.6%), and combating misinformation (17.1%).

F.2 Ideal SDG alignment we want to push for
We show an ideal framework in Figure 19, where for each goal, NLP can help with a subset of the things.And the meaningful action space is the gap between the current NLP research landscape vs. what NLP can potentially do.

G More Website Details G.1 Implementation Details
Although we run the PaperAnalyzer on the entire ACL Anthology.For visualization, we mostly visualize 1980 to 2021, because there were not many papers before 1980, and also the progress tracking of NLP4SG in the recent decades might be more relevant for the audience.Also, since our data was crawled in the middle of 2022, the trend of 2022 is not comparable to the previous year 2021, so we omit 2022 for the moment, but it could be added in the future.
For the visualization of the Sankey diagram, we conduct some preprocessing over the salient scientific terms produced by the model.Since there are variances in the expression of the same task or method, such as "long short-term memory networks," and "LSTMs", we group similar terms to-

Figure 2 :
Figure 2: Our pipeline to automate the NLP4SG report.

Figure 4 :
Figure 4: The homepage of our NLP4SG website.

Figure 5 :
Figure 5: Percentage of social good-related papers (top) and the total number of papers (bottom) in ACL Anthology each year from 1980.

Figure 6 :
Figure 6: SDGs importance vs actual efforts.We use darker red for shorter bars, so the overlooked goals (especially on the right chart) can draw more attention.

Figure 7 :
Figure 7: An example paper in the paper list after clicking on "Health" and "Named Entity Recognition."

Figure 8 :
Figure 8: Visualization of the importance of/efforts on the SDGs by the sustainability researchers, NLP researchers, and actual ACL Anthology papers.

Figure 9 :
Figure 9: Word cloud of the organizations mentioned in the NLP4SG papers.

Figure 10 :
Figure 10: Sankey diagram of the organizations supporting the NLP4SG research work on each SDG.
Figure 12: A decision flowchart used for the annotation of Task 1, NLP4SG binary classification.

G 1 .
Po ve rt y G 2 .H u n g e r G 3 .H e a lt h G 4 .E d u ca ti o n G 5 .G e n d e r G 6 .W a te r G 7 .E n e rg y G 8 .E co n o m y G 9 .In n o va ti o n G 1 0 .In e q u a lit ie s G 1 1 .S u st a in a b le C it ie s G 1 2 .C o n su m p ti o n G 1 3 .C lim a te G 1 4 .Li fe B e lo w W a te r G 1 5 .Li fe o n La n d G 1 6 .Pe a ce G 1 7 .Pa rt n e rs h ip

[
InstructGPT response] E Details of the Interpretability Analysis E.1 Task 1: Example LIME Outputs Using the best model, we show examples of LIME output for true positive (TP), true negative (TN), false positive (FP), and false negative (FN) data samples in Figures 14 to 17, respectively.

Figure 14 :
Figure 14: LIME output for a TP example.

Figure 15 :
Figure 15: LIME output for a TN example.

Figure 16 :
Figure 16: LIME output for an FP example.

Figure 17 :
Figure 17: LIME output for an FN example.

Figure 18 :
Figure 18: Backgrounds of the survey subjects, including the number of years in NLP research, career stage, and research areas.

Table 4 :
Model performance on Task 2. We report the weighted F1, percentage of samples whose groundtruth list of goals exactly matches (EM) with the model output, percentage of samples whose ground-truth list of goals partially matches (PM) with the model output, precision (P), and recall (R).

Table 5 :
The class-specific performance of our bestperforming InstructGPT method.

Table 7 :
Number of upsampled papers by SDGs.We first identify the candidate papers (# Matched Papers), and then manually annotate them (# Papers after Annotation).In total, there are 261 candidate papers, from which 167 are identified as relevant for NLP4SG.

Table 8 :
Example annotation of Task 3.

Table 9 :
Cohen's kappa coefficients for each SDG.terials prior to 2016 in the ACL Anthology are licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 International License, and materials published in or after 2016 are licensed on a Creative Commons Attribution 4.0 International License.

Table 10 :
Quality evaluation of the distant supervision heuristics reported on the training set.

Table 11 :
For the best-performing model, SciB-ERT+DS, we conduct ablation studies by removing the different subsets of our DS data.The resulting performance correlates with the DS data quality reported in Table

•
Instruction: Here we tell InstructGPT what to do in natural language form.• Paper content: We concatenate the title and abstract of the paper.• Final question: We present the question again to reinforce the explanation.
Table 12, where the first example shows sufficient intermediate reasoning steps starting from what the paper contributes, to intermediate effects of the paper, and finally to the end social impact, while the second example lacks intermediate reasoning steps, and mentions only what the paper does, followed the UN SDG right after, without intermediate explanations.