How Good Is NLP? A Sober Look at NLP Tasks through the Lens of Social Impact

Recent years have seen many breakthroughs in natural language processing (NLP), transitioning it from a mostly theoretical field to one with many real-world applications. Noting the rising number of applications of other machine learning and AI techniques with pervasive societal impact, we anticipate the rising importance of developing NLP technologies for social good. Inspired by theories in moral philosophy and global priorities research, we aim to promote a guideline for social good in the context of NLP. We lay the foundations via the moral philosophy definition of social good, propose a framework to evaluate the direct and indirect real-world impact of NLP tasks, and adopt the methodology of global priorities research to identify priority causes for NLP research. Finally, we use our theoretical framework to provide some practical guidelines for future NLP research for social good. Our data and code are available at http://github.com/zhijing-jin/nlp4sg_acl2021. In addition, we curate a list of papers and resources on NLP for social good at https://github.com/zhijing-jin/NLP4SocialGood_Papers.


Introduction
Advances on multiple NLP fronts have given rise to a plethora of applications that are now integrated into our daily lives. NLP-based intelligent agents like Amazon Echo and Google Home have entered millions of households (Voicebot, 2020). NLP tools are now prevalent on phones, in cars, and in many daily services such as Google search and electronic health record analysis (Townsend, 2013).
In the current COVID-19 context, NLP has already had important positive social impact in the face of a public health crisis. When the pandemic broke out, Allen AI collected the CORD-19 dataset (Wang et al., 2020) with the goal of helping public health experts efficiently sift through the myriad of COVID-19 research papers that emerged in a short time period. Subsequently, NLP services such as Amazon Kendra were deployed to help organize the research knowledge around COVID-19 (Bhatia et al., 2020). The NLP research community worked on several problems like the question-answering and summarization system CAiRE-COVID , the expressive interviewing conversational system (Welch et al., 2020) and annotation schemas to help fight COVID-19 misinformation online (Alam et al., 2020;Hossain et al., 2020).
As NLP transits from theory into practice and into daily lives, unintended negative consequences that early theoretical researchers did not anticipate have also emerged, from the toxic language of Microsoft's Twitter bot Tay (Shah and Chokkattu, 2016), to the leak of privacy of Amazon Alexa (Chung et al., 2017). A current highly-debated topic in NLP ethics is GPT-3 (Brown et al., 2020), whose risks and harms include encoding gender and racist biases (Bender et al., 2021).
It is now evident that we must consider the negative and positive impacts of NLP as two sides of the same coin, a consequence of how NLP and more generally AI pervade our daily lives. The consideration of the negative impacts of AI has engendered the recent and popular interdisciplinary field of AI ethics, which puts forth issues such as algorithmic bias, fairness, transparency and equity with an aim to provide recommendations for ethical development of algorithms.
Highly influential works in AI ethics include (Buolamwini and Gebru, 2018;Mitchell et al., 2019;Raji et al., 2020;Chen et al., 2019;Blodgett et al., 2020). AI for social good (AI4SG) (Tomašev et al., 2020) is a related sub-field that benefits from results of AI ethics and while keeping ethical principles as a pre-requisite, has the goal of creating positive impact and addressing society's biggest challenges. Work in this space includes Wang et al. (2020); Bhatia et al. (2020); Killian et al. (2019); Lampos et al. (2020).
Active conversations about ethics and social good have expanded broadly, in the NLP community as well as the broader AI and ML communities. Starting with early discussions in works such as (Hovy and Spruit, 2016;Leidner and Plachouras, 2017), the communities introduced the first workshop on ethics in NLP (Hovy et al., 2017) and the AI for social good workshop (Luck et al., 2018), which inspired various follow-up workshops at venues like ICML and ICLR. The NLP for Positive Impact Workshops (Field et al., 2020;Biester et al., 2022) find inspiration from these early papers and workshops. In 2020, NeurIPS required all research papers to submit broader impact statements (Castelvecchi, 2020;Gibney, 2020). NLP conferences followed suit and introduced optional ethical and impact statements, starting with ACL in 2021 (Association for Computational Linguistics, 2021).
With the growing impact of our models in daily lives, we need comprehensive guidelines for following ethical standards to result in positive impact and prevent unnecessary societal harm. Tomašev et al. (2020) provide general guidelines for successful AI4SG collaborations through the lens of United Nations (UN) sustainable development goals (SDGs) (United Nations, 2015) and Hovy and Spruit (2016); Leidner and Plachouras (2017) begin the ethics discussions in NLP. However, there is room for iteration in terms of presenting a comprehensive picture of NLP for social good, with an evaluation framework and guidelines. At the moment, researchers eager to make a beneficial contribution need to base their research agenda on intuition and word of mouth recommendations, rather than a scientific evaluation framework.
To this end, our paper presents a modest effort to the understanding of social good, and sketches thinking guidelines and heuristics for NLP for social good. Our main goal is to answer the question: Given a specific researcher or team with skills s, and the set of NLP technologies T they can work on, what is the best technology t ∈ T for them to optimize the social good impact I?
In order to answer this overall question, we take a multidisciplinary approach in our paper: • §2 relies on theories in moral philosophy to approach what is social good versus bad (i.e., the sign and rough magnitude of impact I for a direct act a); • §3 relies on causal structure models as a framework to estimate I for t ∈ T , considering that t can be an indirect cause of impact; • §4 relies on concepts from global priorities research and economics to introduce a highlevel framework to choose a technology t that optimizes the social impact I; • §5 applies the above tools to analyze several example NLP directions, and provides a practical guide on how to reflect on the social impact of NLP. We acknowledge the iterative nature of a newly emerging field in NLP for social good, requiring continuing discussions on definitions and the development of ethical frameworks and guidelines. Echoing the history of scientific development (Kuhn, 2012), the goal of our work is not to provide a perfect, quantitative, and deterministic answer about how to maximize social good with our NLP applications. The scope of our work is to take one step closer to a comprehensive understanding, through high-level philosophies, thinking frameworks, together with heuristics and examples.
2 What is social good?
Defining social good can be controversial. For example, if we define saving energy as social good, then what about people who get sick because of not turning on the air-conditioner on a cold day? Therefore, social good is context-dependent, relevant to people, times, and states of nature (Broome, 2017). This section is to provide a theoretical framework about the social impact I for a direct act a.

Moral philosophy theories
We can observe that for some acts, it is relatively certain to judge whether the impact is positive or negative. For example, solving global hunger is in general a positive act. Such judgement is called intuitionalism (Sidgwick, 1874), a school of moral philosophy.
There are many areas of social impact that cannot receive consensus by intuitions. To find analytical solutions to these debatable topics, several moral philosophies have been proposed. We introduce below three categories of philosophical perspectives to judge moral laws (Kagan, 2018), and provide the percentage of professional philosophers who support the theory (Bourget and Chalmers, 2014): 1. Deontology: emphasizes duties or rules, endorsed by 25.9% philosophers; 2. Consequentialism: emphasizes consequences of acts, endorsed by 23.6% philosophers; 3. Virtue ethics: emphasizes virtues and moral character, endorsed by 18.2% philosophers. Note that the above three schools, deontology, consequentialism, and virtue ethics, follows the standard textbook introductions for normative ethics in the analytic philosophy tradition. It is also possible for future research to consider different perspectives while defining social good.
A practical guide for using these philosophies. The three perspectives provide us dimensions to think about the impact I of an act a, so that the final decision is (hopefully) more reliable than one single thought which is subject to biases. Such decomposition practices are often used in highly complicated analyses (e.g., business decisions), such as radar charts to rate a decision/candidate or SMART goals.
A practical guide for using moral philosophies to judge an act a is to think along each of the three perspectives, collect estimations of how good the act a is from the three dimensions, and merge them. For example, using NLP for healthcare to save lives can be good from all three perspectives, and thus it is an overall social good act.
When merging judgements from the above philosophical views, there can be tradeoffs, such as sacrificing one life for five lives in the Trolley problem (Thomson, 1976), which scores high on consequentialism but low on deontology and virtue ethics. One solution by the moral uncertainty theory (MacAskill et al., 2020) is to favor acts with more balanced judgements on all criteria, and reject acts that are completely unacceptable on any criterion.

Principles for future AI
Many agencies from academia, government, and industries have proposed principles for future AI (Jobin et al., 2019), which can be regarded as a practical guide by deontology. Zeng et al. (2019) surveyed the principles of the governance of AI proposed by 27 agencies. The main areas are as follows (with keywords): • Humanity: beneficial, well-being, human right, dignity, freedom, education, humanfriendly.
• Privacy: personal information, data protection, explicit confirmation, control of the data, notice and consent.

Evaluating the indirect impact of NLP
Given the general moral guide to judge an act with direct impacts, we now step towards the second stage -understanding the downstream impact of scientific research which typically has indirect impacts. For example, it is not easily tractable to estimate the impact of some linguistic theories. To sketch a solution, this section will first classify NLP tasks by the dimension of theory→application, and then provide an evaluation framework for I of a technology t that may have indirect real-life impacts.

Classifying tasks from upstream to downstream
To evaluate each NLP research topic, we propose four stages in the theory→application development, as shown in Figure 1, and categorize the 570 long papers from ACL 2020 2 according to the four stages in Figure 2. Details of the annotation are in Appendix A. The four stages are as follows.
Stage 1. Fundamental theories. Fundamental theories are the foundations of knowledge, such as linguistic theories by Noam Chomsky. In ACL 2020, the most prevalent topic for papers in Stage 1 is linguistics theory in Figure 2. Importantly, Stage 1's main goal is the advancement of knowledge, and to widen the potentials for later-stage research.
Stage 2. Building block tools. Moving one step from theory towards applications is the research on building block tools, which serves as important building blocks and toolboxes for downstream technologies. The most frequently researched Stage-2 topics at ACL 2020 are information extraction, model design, and interpretability (in Figure 2). Stage 3. Applicable tools. Applicable tools are pre-commercialized NLP systems which can serve as the backbones of real-world applications. This category includes NLP tasks such as dialog response generation, question answering, and machine translation. The most common research topics in this category are dialog, machine translation, and question answering (in Figure 2).
Stage 4. Deployed applications/products. Deployed applications often build upon tools in Stage 3, and wrap them with user interfaces, customer services, and business models. Typical examples of Stage-4 technologies include Amazon Echo, Google Translate, and so on. The top three topics of ACL 2020 papers in this category are ways to address misinformation (e.g., a fact checker for news bias), dialog, and NLP for healthcare.

Estimating impact
Direct impacts of Stage-4 technologies. A direct impact of NLP development is allowing users more free time. This is evident in automatic machine translation, which saves the effort and time of human translators, or in NLP for healthcare, which allows doctors to more quickly sift through patient history. Automatic fake news detection frees up  time for human fact-checkers, to aid them in more quickly detecting fake news through the increasing number of digital news articles being published. The impact of more user free time is varied. In the case of healthcare, NLP can free up time for more personalized patient care, or allow free time for activities of choice, such as spending time on passion projects or more time with family. We recognize these varied impacts of NLP deployment, and recommend user productivity as one way to measure it.
Note that there can be positive as well as negative impact associated with rising productivity, and the polarity can be decided according to Section 2.1. Typical positive impacts of NLP technology include better healthcare and well-being, and in some cases it indirectly helps with avoiding existential risks, sustainability, and so on. Typical negative impacts include more prevalent surveillance, propaganda, breach of privacy, and so on. For example, intelligent bots can improve efficiency at work (to benefit economics), and bring generally better well-being for households, but they might leak user privacy (Chung et al., 2017).
Thus, estimating the overall end impact of a technology t in the Stage 4 needs to accumulate over a set of aspects AS: where scale as (t) is the usage scale of applications of technology t used in the aspect as, and  impact as (t) is the impact of t in this aspect.
Indirect impacts of early stage technologies. Although the direct impact of Stage-4 technologies can be estimated by Eq. (1), it is difficult to calculate the impact of a technology in earlier stages (i.e., Stage 1-3).
We can approach the calculation of indirect impacts I of an early-stage technology t by a structural causal model. As shown in the causal graph G in Figure 3, each technology t is in a causal chain from its parent vertex set PA(t) (i.e., upstream technologies that directly causes the invention of t), to its children vertex set (i.e., downstream technologies directly resulting from t). Formally, we denote a directed (causal) path in G as a sequence of distinct vertices (t 1 , t 2 , . . . , t n ) such that t i+1 ∈ CH(t i ) for all i = 1, . . . , n − 1. We call t n a descendant of t 1 . After enumerating all paths, we denote the set of all descendants of t as DE(t). Specifically, we denote all descendant nodes in Stage 4 as Stage-4 DE(t).
Hence, the impact of any technology t is the sum of impact of all its descendants in Stage 4: (2) where p(x) is the probability that the descendent technology x can be successfully developed, c x (t) is the contribution of t to x, and I(x) can be calculated by Eq. (1). This formula can also be interpreted from the light of do-calculus (Pearl, 1995) as P (X|do(t)) − P (X), for X ∈ Stage-4 DE(t), which means the effect of intervention do(t) on Stage 4 descendants.
Note that Eq. (1) and (2) are meta frameworks, and we leave it to future work to utilize these for assessing the social impact of their work.

Takeaways for NLP tasks
With the growing interest of AI and NLP publication venues (e.g., NeurIPS, ACL) in ethical and broader impact statements, it will be useful and important for researchers to have practical guidelines on evaluating the impact of their NLP tasks.
We first introduce some thinking steps to estimate the impact of research on an NLP task t: (S1) Classify the NLP task t into one of the four stages ( §3.1) (S2) If t is in Stage 4, think of the set of aspects AS that t will impact, the scale of applications, and aspect-specific impact magnitude. Finally, estimate impact using Eq. (1). (S2') If t is in Stage 1-3, think of its descendant technologies, their success rate, and the contribution of t to them. Finally, estimate impact using Eq. (1) and (2).
Next, we introduce some high-level heuristics to facilitate fast decisions: (H1) For earlier stages (i.e., Stage 1-2), it is challenging to quantify the exact social impact. Their overall impact tends to lean towards positive as they create more knowledge that benefits future technology development.

Deciding research priority
There are many directions for expansion of our efforts for social good; however, due to limited resources and availability of support for each researcher, we provide a research priority list. In this section, we are effectively trying to answer the overall question proposed in Section 1. Specifically, we adopt the practice in the research field global priorities (GP) (MacAskill, 2015;Greaves and McAskill, 2017). We first introduce the highlevel decision-making framework in Section 4.1, and then formulate these principles using technical terms in Section 4.2.

Important/Neglected/Tractable (INT) framework
Our thinking framework to address the research priority follows the practice of existing cost-benefit analysis in GP (MacAskill, 2015;Greaves and McAskill, 2017), which aligns with the norms in established fields such as development economics, welfare economics, and public policy. We draw an analogy between the existing GP research and NLP for social good. Basically, GP addresses the following problem: given, for example, 500 billion US dollars (which is the annual worldwide expenditure on social good), what priority areas should we spend on? Inspired by this practical setting, we form an analogy to NLP research efforts, namely to answer the question proposed in Section 1 about how to attribute resources and efforts on NLP research for social good.
The high-level intuitions are drawn from the Important/Neglected/Tractable (INT) framework (MacAskill, 2015), a commonly adopted framework in global priorities research on social good. Assume each agent has something to contribute (e.g., money, effort, etc.). It is generally effective to contribute to important, neglected, and tractable areas.

Calculation of priority
Although the INT framework is commonly used in practice of many philanthropy organizations (MacAskill, 2015), it will be more helpful to formulate it using mathematical terms and economic concepts. Note that the terms we formulate in this section can be regarded as elements in our proposed thinking framework, but they are not directly calculable. 3 Our end goal is to estimate the cost-effectiveness of contributing a unit time and effort of a certain researcher or team to research on the technology t. So far we have a meta framework to estimate the impacts I brought by successful development of a technology t. And we introduce the notations in Table 1. 3 We adapted these terms from GP. Such terms to estimate priority has been successfully used by real-world social good organizations, e.g., GiveWell, Global Priorities Institute, the Open Philanthropy Project (a foundation with over 10 billion USD investment), ReThink Priorities, 80,000 Hours Organization. In the long run, the NLP community may potentially benefit from aligning with GP's terminology. Still, we do not recommend applying our framework in high-stake settings yet, since it serves only as a starting point currently.

Notation Meaning r
An NLP researcher or research group T (r) The set of NLP topics that the researcher can pursue (limited by skills, resources, and passion) t An NLP technology I(t) Social impacts brought by successful development of t prog(t) The current progress of t p(t; r) Probability that research in t succeeds based on the skills of the researcher r p(t; r)I(t) Expected social impact of the researcher r's work on t ∆t(r) Improvement of t per unit resource (incl. time, effort, money, etc.) of the researcher r For a researcher r, the action set per unit resource is {∆t|t ∈ T (r)}. Equivalently speaking, they can intervene at a node t by the amount of ∆t(r) in the structured causal graph G in Figure 3.
The first useful concept is p(t; r)I(t), the expected social impact of research on a technology t. Here the success rate p(t; r) is crucial because most research does not necessarily produce the expected outcome. However, if the impact of a technology can be extremely large (for example, prevention of extinction has impact near positive infinity), then even with a very little success rate, we should still devote considerable efforts into it.
The second concept that is worth attention is the marginal impact (Pindyck et al., 1995) of one more unit of resources of the researcher r into the technology t, calculated as ∆I(t; r) := I(prog(t) + ∆t(r)) − I(prog(t)) . (3) For example, if the field associated with the technology is almost saturated, or if many other researchers working on this field are highly competent, then, for a certain research group, blindly devoting time to the field may have little marginal impact. However, on the other hand, if a field is important but neglected, the marginal impact of pushing it forward can be large. This also explains why researchers are passionate about creating a new research field.
The third useful concept is the opportunity cost (Palmer and Raftery, 1999) to devote researcher r's resources into the technology t instead of a possibly more optimal technology t . Formally, the opportunity cost is calculated as t (r) := arg max x ∆I(x(r)), Cost(t; r) := ∆I(t (r); r) − ∆I(t; r) , (5) where t is the optimal technology that can bring the largest expected improvement of social impact. The opportunity cost conveys the important message that we should not just do good, but do the best, because the difference from good to best can be a large loss.
Estimating the variables. Note that the frameworks we have proposed so far are at the meta level, useful for guiding thought experiments, and future research. Exact calculations are not possible with the current state of research in NLP for social good, although achievable in the future. A practical insight is that NLP researchers estimate the impact of their research via qualitative explanations (natural language) or rough quantitative ones. For example, the introduction section of most NLP papers or funding proposals is a natural language-based estimation of the impact of the research. Such estimations can be useful to some extent (Hubbard and Drummond, 2011), although precise indicators of impact can motivate the work more strongly.
We can also borrow some criteria from effective altruism, a global movement that establishes a philosophical framework, and also statistical calculations of social good. One of the established metrics for calculating impact is called the "quality-adjusted life years" (QALYs) proposed by MacAskill (2015). QALYs count the number of life years (calibrated by life quality such as health conditions) that an act helps to increase.

Evaluating NLP tasks
In this section, we will first try to categorize the current state of NLP research for social good based on ACL 2020 papers, and then highlight NLP topics that are aligned with the UN's SDGs. We will conclude with a practical checklist and case studies of common NLP tasks using this checklist.

Current state of NLP research for social good -ACL 2020 as a case study
We want to compare the ideal priority list with the current distribution of NLP papers for social good. As a case study of the current research frontier, we plot the topic distribution of the 89 ACL 2020 papers that are related to NLP for social good in Figure 4. We also show the portion of papers by the 10 countries with the most social-good papers.
Our annotation details are in Appendix A. Illustrated in Figure 4, most social-good papers work on interpretability, tackling misinformation (e.g., fact-checking for news), and healthcare (e.g., to increase the capacity of doctors). In terms of countries, the US has the most papers on interpretability, and no papers on NLP for education, NLP for legal applications, and some other topics. China has few papers on interpretability, although interpretability is the largest topic. India has no papers on fighting misinformation, although it is the second largest topic. Only 5 countries have publications across more than two social good topics. Please refer to Appendix B for more analyses such as social-good papers by academia vs. industries.
However, compared with the UN's SDGs (United Nations, 2015), the current NLP research (at least in the scope of ACL conference submissions) lacks attention to other important cause areas such as tackling global hunger, extreme poverty, clean water and sanitation, and clean energy. There are also too few research papers on NLP for education, although education is the 4th most important area in SDGs.
One cause of this difference is value misalignment. Most NLP research is supported by stakeholders and funding agencies, which have a large impact on the current research trends or preferences in the NLP community. The perspective from social good with a framework to calculate the priority list has still not reached many in the NLP community.
Although we do not have data on expenditure    in each NLP subarea, we can get a glimpse of the value misalignment in general. Table 2 shows the annual spending of some cause areas. Note that the ranking of the expenditure does not align with our priority list for social good. For example, luxury goods are not as important as global poverty, but luxury goods cost 1.3 trillion USD each year, almost five times the expenditure in global poverty.

Aligning NLP with social good
In this subsection, we list the top priorities according to UN's SDGs (United Nations, 2015). For each goal, in Table 3 we include examples of existing NLP research, and suggest potential NLP tasks that can be developed (labeled as (proposed)).

Checklist
As a practical guide, we compile the takeaways of this paper into a list of heuristics that might be helpful for future practioners of NLP for social good. To inspect the social goodness of an NLP research direction (especially in Stage 3-4), the potential list of questions to answer is as follows: (Q1) What kind of people/process will benefit from or be harmed by the technology? (Q2) Does it reinforce the traditional structure of beneficiaries? I.e., what groups of underprivileged people can be benefited? (e.g., by gender, demographics, socio-economic status, country, native languages, disability type) (Q3) Does it contribute to SDG priority goals such as poverty, hunger, health, education, equality, clean water, and clean energy? (Q4) Can it directly improve quality of lives? E.g., how many QALYs might it result in? (Q5) Does it count as (a) mitigating problems brought by NLP, or (b) proactively helping out-of-NLP social problems?

Case studies by the checklist
We conduct some case studies of NLP technologies using the checklist.

Low-resource NLP & machine translation.
This category includes NLP on low-resource languages, such as NLP for Filipino (Sagum et al., 2019;Cruz et al., 2020), and MT for Haitian Creole after the 2010 Haiti earthquake (Lewis, 2010), and machine translation in general. Because this direction expands the users of NLP technologies from English-speaking people to other languages, it benefits people speaking these languages (Q1), and helps to narrow the gap between English-speaking and non-English speaking end users (Q2), although it is still likely that people who can afford intelligent devices will benefit more than those who cannot. This category can contribute directly to goals such as equality and education, and indirectly to other goals because translation of documents in general helps the sharing of information and knowledge (Q3). It directly improves quality of lives, for example, for immigrants who may have difficulties with the local language (Q4). Thus, it counts as social good category (b) in (Q5).
Transparency, interpretability, algorithmic fairness and bias. Research in this direction can impact users who need more reliable decisionmaking NLP, such as the selection process for loans, jobs, criminal judgements, and medical treatments (Q1). It can shorten the waiting time of candidates and still make fair decisions regardless of spurious correlations (Q2) (Q4). It reduces inequality raised by AI, but not increasing equality over man-made decisions, at least by the current technology (Q2). Thus, it is social good category (a) in (Q5).
Green NLP. Green NLP reduces the energy consumption of large-scale NLP models. Although it works towards the goal of affordable and clean energy (Q3) by neutralizing the negative impact of training NLP models, but it does not impact outof-NLP energy problems. Green NLP belongs to social good category (a) in (Q5). It does not have large impacts directly targeted at (Q1), (Q2) and (Q4).
QA & dialog. People who can afford devices embedded with intelligent agents can use it, which is about 48.46% of the global population (BankMy-Cell, 2021) (Q1). So this benefits people with higher socio-economic status, and benefits English speaking people more than others, not to mention job replacements for labor-intensive service positions (Q2). It does not contribute to priority goals except for education and healthcare for people who can afford intelligent devices (Q3). Nonetheless, it can improve the quality of lives for its user group (Q4). It can be regarded as social good of category (b) in (Q5).
Information extraction, NLP-powered search engine & summarization. This direction speeds up the information compilation process, which can increase the productivity in many areas. About 50% of the world population have access to the Internet and thus can use it (Meeker, 2019) (Q1) (Q2). This category indirectly helps education, and the information compilation process of other goals (Q3). It can largely improve the lives of its user group because people gather information very frequently (e.g., do at least one Google search every day) (Q4). Thus, it belongs to social good category (b) in (Q5).
NLP for social media. Research on social media provides tools for multiple parties. Social scientists can mine interesting trends and cultural phenomena; politicians can survey constituents' opinions and influence them; companies can investigate user interests and expand their markets (Q1). The caveat of dual use is large, and heavily rely on the stakeholders' intent: exploitation of the tools will lead to bleach of user privacy, and information manipulation, whereas good use of the tools can help evidence-based policy makers (social good category (a) in (Q5)), and help to understand the driving principles of democratic behavior and combat the mechanisms that undermine it (social good category (b) in (Q5)). Such diverse possibilities of parties who use them leave (Q2) and (Q4) unanswerable. Also, this research direction has limited (and often indirect) contribution to priorities such as poverty and hunger, unless the related policies are in heat discussion online (Q3).

Conclusion
This paper presented a meta framework to evaluate NLP tasks in the light of social good, and proposed a practical guide for practitioners in NLP. We call for more attention towards awareness and categorization of social impact of NLP research, and we envision future NLP research taking on an important social role and contributing to multiple priority areas. We also acknowledge the iterative nature of this emerging field, requiring continuing discussions, improvements to our thinking framework and different ways to implement it in practice. We highlight that the goal of our work is to take one step closer to a comprehensive understanding of social good rather than introducing a deterministic answer about how to maximize social good with NLP applications.
ideas and methodology of the paper. We thank Osmond Wang for checking the economic concepts in the paper. We also thank Chris Brockett for checking many details in the paper. We thank the labmates in the LIT lab at University of Michigan, especially Laura Biester, Ian Stewart, Ashkan Kazemi, and Andrew Lee for constructive feedback. We also thank labmates at the MIT MEDG group, especially William Boag and Peter Szolovits for their constructive feedback. We also received many feedbacks based on the first version of the paper, -we thank Niklas Stoehr for constructive suggestions to help some arguments be more comprehensive in the current version. We thank Jingwei Ni for the help with the annotation of the country and affiliation of the ACL 2020 papers.

Ethical and societal implications
Our paper establishes a framework to better understand the definition of social good in the context of NLP research, and lays out a recommended direction on how to achieve it. The contributions of our paper could benefit a focused, organized and accountable development of NLP for social good. The data used in our work is public, and without privacy concerns.

A ACL 2020 paper annotations
For the case study on ACL 2020 papers, such as Figure 2 and 4, we collect the 570 long papers at ACL 2020. An NLP researcher with four years of research experience conducted the entire annotation, so that the categorization is consistent across all papers. 4 The first annotation task is to categorize all papers into one of the four stages in the theory→application development. We showed the annotator the description of the four stages in Section 3.1. Next, provided with the title, abstract, and PDF of each paper, the annotator was asked to annotate which of the four stages each paper belongs to. The annotator had passed a test batch before starting the large-scale annotation.
The second annotation task is to annotate the research topics of the papers related to social good at ACL 2020. If the paper has a clear social good impact (89 out of 570 papers), the annotator needs to classify the topic of the paper into one of the given categories: bias mitigation, education, equality, fighting misinformation, green NLP, healthcare, interpretability, legal applications, low-resource language, mental healthcare, robustness, science literature parsing, and others. For the other meta information such as countries, or academia vs. industry, we decide based on the information of the leading first author.

B More statistics about ACL 2020 papers
For the case study on ACL 2020 papers, we further investigate the following statistics.
Stage 1-4 by countries. Recall that in Figure 2 of the main paper, we plot the distributions of papers by the four stages, and highlight the most frequent topics in each stage. Additionally, it is also interesting to explore the distribution of stages for different countries. In Figure 5, we have the following observations: China does not have Stage-1 papers (i.e., fundamental theories), although it has the second largest total number of papers. The reason might be that there are not many Chinese researchers on linguistic theories who publish at English conferences.
Most countries' number of papers in the four stages follows the overall trend (i.e., Stage-2 papers > Stage-3 papers > Stage-4 papers > Stage-1 4 The annotation file has been uploaded to the softconf system. papers), with a few exceptions. For example, China has almost the same number of papers in Stage 2 and 3, Germany has more papers in Stage 4 (i.e., deployed applications) than in Stage 3, and Canada has the most papers in Stage 3. Social good topics by academia vs. industry. As we call for more research attention to NLP for social good, it is important to understand the affiliations behind the current social good papers. A coarse way is to look at the affiliation of the first author, and inspect whether the main work of the paper is done by people from academia or industry. As in Figure 6, overall academia publishes several times more papers on social good than the industry. This ratio is higher than the average ratio of papers from academia out of all ACL 2020 papers (389 from academia out of 570). Industry does not have ACL 2020 papers on topics such as NLP ethics. Note that using statistics from ACL papers alone could be limiting because researchers in academia typically present almost all research achievements through publications, but many industry researchers do not publish in public venues such as ACL, although their research may impact various products.