What Changes Can Large-scale Language Models Bring? Intensive Study on HyperCLOVA: Billions-scale Korean Generative Pretrained Transformers

GPT-3 shows remarkable in-context learning ability of large-scale language models (LMs) trained on hundreds of billion scale data. Here we address some remaining issues less reported by the GPT-3 paper, such as a non-English LM, the performances of different sized models, and the effect of recently introduced prompt optimization on in-context learning. To achieve this, we introduce HyperCLOVA, a Korean variant of 82B GPT-3 trained on a Korean-centric corpus of 560B tokens. Enhanced by our Korean-specific tokenization, HyperCLOVA with our training configuration shows state-of-the-art in-context zero-shot and few-shot learning performances on various downstream tasks in Korean. Also, we show the performance benefits of prompt-based learning and demonstrate how it can be integrated into the prompt engineering pipeline. Then we discuss the possibility of materializing the No Code AI paradigm by providing AI prototyping capabilities to non-experts of ML by introducing HyperCLOVA studio, an interactive prompt engineering interface. Lastly, we demonstrate the potential of our methods with three successful in-house applications.


Introduction
Due to its remarkable zero-shot and few-shot performances, GPT-3's in-context learning has gained significant attention in the AI community (Brown et al., 2020). In the in-context learning approach, discrete prompts that consist of a natural language task description and few-shot examples control * Equal contribution. large-scale language models (LMs) to infer predictions for the target task. Using OpenAI's GPT-3, studies have proposed methods that can further boost in-context learning performance of GPT-3 (Zhao et al., 2021;Liu et al., 2021a). Recently, prompt-based learning methods have been reported to improve the performances of BERT, GPT-3, and T5 without any parameter updates of the main model (Liu et al., 2021b;Lester et al., 2021;Shin et al., 2020).
We consider the three following practical issues of using GPT-3. First, the language composition of the training corpus is heavily skewed towards English with 92.7%. This makes it difficult to apply it to tasks in other languages. We also know little about how to train similar models in another language with different linguistic properties, and where the originally proposed methods will be naturally applied and where they might fail. Second, while it is pragmatic and useful to know the capabilities of various sized models considering the operation costs of using large-scale LMs, we only have access to a thorough analysis of models of 13B and 175B (Brown et al., 2020) but none in between. Lastly, advanced prompt-based learning methods that require backward gradients of inputs, including continuous prompt-based tuning, have not yet been experimented for an in-context largescale LM learner.
Here we address these issues by introducing a non-English GPT-3 with various parameter sizes and intensively investigating their capabilities on diverse real-world classification and generation tasks under in-context few-shot learning and promptbased optimization. We introduce a Korean in-context large-scale LM with 82B parameters, i.e., HyperCLOVA. This is the first discovery on near 100B-scale non-English LM. We present the corpus composition of Korean datasets used for Hy-perCLOVA, and describe how we crawl and refine such data to collect 561B tokens of Korean corpus ( §3.1). We also design a new Korean tokenization method based on the agglutinative property for HyperCLOVA. We use byte-level BPE (Kudo and Richardson, 2018) with a morpheme analyzer ( §3.3). Our results show that such tokenization strategy is important for the performance of downstream tasks in large-scale in-context learning ( §4.4).
We report the state-of-the-art in-context learning performance of our model on Korean datasets in zero and few-shot settings ( §4.2). In addition, we are the first to discovery the applicability of the continuous prompt-based optimization techniques, such as p-tuning (Liu et al., 2021b), to largescale LMs. HyperCLOVA leveraged by p-tuning achieves outstanding results for both classification and generation tasks. Also, we investigate the effects of p-tuning on two mid-size HyperCLOVA ( §4.3).
Subsequently, we illustrate the versatility of operating a single large-scale LM in the AI industry. Developing an AI product involves heavy collaboration among non-technical experts. This incurs substantial communication overhead because the level of technical abstraction varies across job functions.
We introduce HyperCLOVA Studio, an interactive prompt engineering interface which provides GUI and API interfaces like the OpenAI playground 1 . The interactive interface helps nonexperts of ML to easily use HyperCLOVA for prototyping AI products. We also share three in-house application scenarios using HyperCLOVA Studio as novel task environments. With minimal efforts of a domain expert in these scenarios, HyperCLOVA presents performances qualitatively comparable to human experts, despite their difficulty in designing their objective function and training data ( §5.2).
We then discuss how the functionality of Hy-perCLOVA Studio can be extended. For example, HyperCLOVA Studio can provide input gradient functionality, to fine-tune small prompt encoder with few number of instances, thus enabling any user to achieve state-of-the-art performance using 1 https://beta.openai.com/ HyperCLOVA ( §5.3). Finally, we discuss the possibility of No/Low Code AI paradigm using Hyper-CLOVA Studio, in which one large LM empowers people to create AI systems with no need for training individual deep learning models or collecting and labeling suitable datasets ( §5.4).
Our contributions are summarized as: 1. We introduce HyperCLOVA, a large-scale Korean in-context learning-based LM with nearly 100B parameters, by constructing a large Korean-centric corpus of 560B tokens.
2. We discover the effect of language-specific tokenization on large-scale in-context LMs for training corpus of non-English languages.
3. We explore the zero-shot and few-shot capabilities of mid-size HyperCLOVA with 39B and 82B parameters and find that prompt-based tuning can enhance the performances, outperforming state-of-the-art models on downstream tasks when backward gradients of inputs are available.
4. We argue the possibility of realizing No Code AI by designing and applying HyperCLOVA Studio to our in-house applications. We will release HyperCLOVA Studio with input gradients, output filters, and knowledge injection.

Prompt Optimization
Prompt-based approaches involve constructing optimal prompts for language models to best elicit knowledge and maximize prediction performances (Radford et al., 2019;Brown et al., 2020;Schick and Schütze, 2020). As the scale of language models grows, the potential of replacing the full finetuning paradigm with the prompt-based approach has been reported (Reynolds and McDonell, 2021;Li and Liang, 2021), as learning via prompts is efficient regarding time and space complexity. However, language models are highly sensitive to the prompt design, motivating methodologies for optimizing prompts. Prompt optimization can be categorized into discrete and continuous approaches. The discrete approach optimizes directly on the token space (Ben-David et al., 2021;Shin et al., 2020) and has the advantage of transferability. However, Shin et al.
(2020) showed that the discrete space has poor interpretability and can be suboptimal. These limitations spurred a new direction that aims to optimize prompts in the continuous space. Recent work (Li and Liang, 2021;Hambardzumyan et al., 2021;Liu et al., 2021b;Lester et al., 2021) proposed optimizing the contextualized token spaces without fine-tuning the main LM parameters. Notably, Liu et al. (2021b) found that p-tuning for autoregressive LMs outperforms MLM-based fine-tuning in certain downstream tasks. Lester et al. (2021) further showed that well-optimized prompt-based learning achieves state-of-the-art performance on key benchmarks.

Language Models
Although multilingual language models have been publicly available (Devlin et al., 2019), languagespecific language models are still in demand, as they provide an edge over language-agnostic models (Martin et al., 2020;Nguyen and Nguyen, 2020;Delobelle et al., 2020). However, due to high cost, language-specific language models other than English are limited in availability.
As such, the community has an untapped understanding of non-English in-context learners. To the best of our knowledge, multilingual in-context learners are not even explored yet, and the research on in-context learners is focused on few major languages. Recently, a GPT-like language model trained on Chinese corpora is being actively researched concurrently with our work (Zeng et al., 2021). They successfully trained LMs of 2.6B and 13B parameters using a Chinese corpus. They also share their on-going work for training the 207B model, the corresponding infrastructure, and the training techniques.

Data Description
The ratio of Korean data for OpenAI GPT-3 is very small, with less than 0.02% by character count. 2 Therefore, it is crucial to construct a large Korean-centric corpus in advance to training Hy-perCLOVA.
The major corpus used for pre-training Hyper-CLOVA is listed in Table 1  GPT-3, we gathered all available text data including user-generated content (UGC) and contents provided by external partners, with no violation of legal issues, from both diverse services of NAVER 3 and external sources. We refined the datasets and collected a total of 561B tokens as the final corpus. The corpus was randomly sampled for pre-training. Appendix A.1 describes the detailed data description and discussion. Appendix A.2, A.3, and A.4 thoroughly describe how to clean, anonymize, and preprocess the crawled raw data, respectively.

Model and Learning
We employ the same transformer decoder architecture as GPT-3 of OpenAI (Brown et al., 2020). Table 2 describes the detailed configurations of different model sizes. We make our model design similar to GPT-3, and we set near exponential interpolation from 13B to 175B OpenAI GPT-3. In particular, we aim to explore the capability and representation power of the models with mid-size parameters, which have not yet been addressed by other studies on large-scale LMs (Brown et al., 2020), but practically useful in many applications. These mid-size models can contribute to not only understanding the model properties with several tens of billion parameters, but also practical usages in real-world applications due to their more plausible sizes.
Our model is based on megatron-LM (Shoeybi et al., 2019) and trained on the NVIDIA Superpod, which includes 128 strongly clustered DGX servers with 1,024 A100 GPUs. We use AdamW (Loshchilov and Hutter, 2019) with cosine learning rate scheduling and weight decay as an optimizer. All models use the mini-batch size of 1,024 and the minimum learning rate is 1/10 of the original learning rate. It takes 13.4 days to train a model with 82B parameters with 150B tokens. For experiments in Section 4, the model trained with 150B is used for fair comparison, because not all models are finished training at the same iteration. However, experiments in Section 5.2 use the model trained with 300B tokens, as HyperCLOVA Studio provided the 39B and 82B models trained with 300B tokens.
In our test loss from the encyclopedia corpus not included in HyperCLOVA corpus, we also observe the scaling law, as discovered in previous research (Brown et al., 2020;Kaplan et al., 2020). Figure  2 in Appendix B shows that increasing model size and training longer give advantage.

Korean Tokenization
Korean is an agglutinative language where noun is followed by particle and stem of verb or adjective is followed by endings, expressing various grammatical properties. Properly tokenizing noun and particle, and stems and endings clarifies the semantics of each token. Park et al. (2020) introduce an empirical report that tokenization influences on performances of Korean LM. Overall, we need to design a sophisticated tokenization strategy suitable for Korean LM, different from its English counterpart.
We use morpheme-aware byte-level BPE as our tokenization method. GPT-2 and GPT-3 use byte-level BPE. However, unlike in English, non-English characters like 'ㅎ', '하', or '한' are all split into three different unicode bytes. We alleviate the problem of byte-level BPE by applying morpheme analyzers. See Figure 5 in Appendix E for motivation and detail.
We pre-split sentences by using space and morpheme obtained by an in-house morpheme analyzer. Our morpheme analyzer excludes most of non-Korean characters. Using parts of the sentence presplit by our morpheme analyzer, our morpheme-aware byte-level BPE learns the sentence in which most non-Korean characters are expressed as single byte characters. We use HuggingFace's tokenizers library. 4 4 Experimental Results

Experimental Setting
We mainly use five datasets for evaluating incontext few-shot learning performance. Two of the five datasets come from KLUE (Park et al., 2021), which is a massive benchmark of Korean NLU tasks and a work concurrent to our paper. We also use one additional in-house dataset for evaluating prompt-based optimization performance. NSMC is a movie review dataset from NAVER Movies. 5 The task is binary sentiment classification, like SST-2 (Socher et al., 2013). It contains 150K of training data and 50K of test data. For fewshot experiments, we generate 12 sets, and each set consists of 70 examples randomly sampled from the training set. We average the test accuracies of 12 in-context 70-shot learning models.  2020), which uses test paragraph, corresponding four question-answer pairs, and test question as the input to GPT-3. In other words, our model is a zero-shot learner in the perspective of passage, but a four-shot learner in the perspective of question. We performed a single trial for each model size. AI Hub Korean-English corpus consists of Korean-English parallel sentences from news, government websites, legal documents, etc. 7 The corpus consists of 800K sentence pairs, and we randomly sample 1K pairs for evaluating on Ko → En and En → Ko translation tasks. We performed three random trials for each translation task. Our model is evaluated in four-shot learning and we use four different examples for each trial. We use BLEU score for evaluation, where Moses and MeCab are used for comparison with the result of Park et al.  YNAT (Yonhap News Agency Topic Classification or KLUE-TC), one of the KLUE Benchmark tasks, is a topic classification problem with seven classes (Park et al., 2021). It consists of 45K, 9K, and 9K annotated headlines for training, valid, and test sets, respectively. We average the test accuracies of 3 in-context 70-shot learners. KLUE-STS, another KLUE benchmark task, is a task to predict a sentence similarity between each pair of sentences, where the similarity score has a value between 0 and 5 (Park et al., 2021). We use F1 score after binarizing the real-valued similarity as suggested in the KLUE paper. We average the test accuracies of 3 in-context 40-shot learners. Query modification task is a query modification task for AI speaker users. The task targets the case where a single-turn FAQ system is already operating in AI Speakers. With the query that requires understanding of multi-turn information, the goal of the task is to convert the multi-turn query to a single-turn query, which can then be understood by a single-turn AI speaker. There are 1,326 test instances in total. See Appendix C.3 for detail.  Table 4. Table 3 presents the results of few-shot learning on six tasks. In particular, we explore the performances of HyperCLOVA with mid-size parameters including 39B and 82B, which is not addressed in OpenAI GPT-3 paper (Brown et al., 2020) but can be more practical for real-world applications. Appendix C.1 and C.2 further explains more results of standard deviation and max performance of trials. Table 3 shows that the performances of various in-context learning tasks monotonically increases as the model size increases. However, in-context learning ability of Ko→En translation and KLUE-STS is much lower than baseline. Especially for translation, we conjecture the poor performances on Ko→En might result from lack of English ratio of our corpus. Also, more sophisticated prompt engineering might improve the results, which is future research direction. Table 4 shows the results of prompt-based tuning (ptuning) (Liu et al., 2021b) on NSMC. Although incontext few-shot learning has already achieved near state-of-the-art performance on NSMC, p-tuning enables HyperCLOVA to outperform comparatives with no parameter update of the main model. It is worth noting that p-tuning with only 4K examples only provides comparable results to RoBERTa fine-tuned on 150K data. Considering the results in Table 3 and Table 9 in Appendix C.1, we conjecture that p-tuning significantly enhances the robustness of HyperCLOVA as well as the accuracy.

Prompt-based Tuning
Furthermore, we explore the effects of p-tuning at the input side on performances for generation tasks with the experiments on our in-house query modification. As shown in Table 5, p-tuning enables HyperCLOVA to consistently improve the input query qualities with a significant margin for both zero and three-shot scenarios. In larger models, the influence of the discrete prompt seems to be   less. This result is similar to the trend discovered in (Lester et al., 2021), that as the scale of LM increases, competitive performance can be obtained even if the discrete prompt is not used at all. To the best of our knowledge, this is the first report of applying input-side p-tuning to generation tasks with an in-context LM learner.
These results also imply that when the backward gradients of GPT-3-scale model on input data are accessible, prompt optimization methods are feasible alternatives for enhancing representation power of large-scale LMs for NLP researchers and practitioners without large-scale GPU clusters.

Effect of Tokenization
We analyze the effects of morpheme-aware bytelevel BPE, our tokenization method considering Korean linguistic characteristics. As baselines, we employ byte-level BPE and char-level BPE, two prevalent tokenization methods for pre-training LMs with English-centric corpora. It is noticeable that char-level BPE refers to the original BPE. It yields out-of-vocabulary (OOV), and some Korean character like '젝' is not included in char-level BPE tokens. The other two tokenization strategies do not make OOV tokens. We use models of 1.3B parameters, which is a relatively small size, considering the heavy computation time of pre-training. Nevertheless, it is enough to find evidence of tokenization effects.
As shown in Table 6, our method improves the performance of most tasks compared to the baselines. However, in Ko→En task, morpheme analyzer makes the performance worse. On the other hand, char-level BPE makes much lower performance than byte-level BPE in YNAT. It is because that char-level BPE makes some OOV tokens, and some important words in a headline of YNAT data become hard to understand. For example, a character '젝' (jec) in a word '프로젝트' (project in English) is an OOV token in char-level BPE, which makes the test headline including '프로젝트' incomprehensive. Overall, it is worth noting that carefully designing language-specific tokenization is essential for training large-scale LMs for languages quite different from English in terms of their linguistic properties.

Discussion on Industrial Impacts
What change can large-scale LMs bring? We claim "accelerating the life-cycle of NLP ML operation" as one of the possible answers. Unlike the protocol of most deep learning research where a model is trained with a well-collected dataset by ML experts and its corresponding well-defined objective function, there are several additional steps to make an AI product in a production-level pipeline, which yield tremendous communication overhead and costs. A platform with large-scale LMs may make huge progress by allowing only one non-developer, such as a service designer, to build the prototype system.
Section 5.1 introduces HyperCLOVA Studio as our distribution method of HyperCLOVA. Section 5.2 introduces our three in-house usages of Hy-  perCLOVA Studio. Section 5.3 discusses possible extensions of HyperCLOVA Studio, prompt-based optimization, input module, and output module. Using the evidence above, Section 5.4 discusses No/Low Code AI paradigm.

HyperCLOVA Studio
HyperCLOVA Studio is the place for building and communicating the shared artifact generated by HyperCLOVA. HyperCLOVA Studio serves two functions, 1) it can provide a GUI interface, like the OpenAI Playground, and 2) support API end point in which the output can be easily acquired by an API call with diverse functions, including ones not yet provided by OpenAI Playground. These advanced functions are specified in Section 5.3. Figure 3 in Appendix D shows our GUI interface. The biggest advantage of HyperCLOVA Studio is that it allows rapid prototyping of AI-based services while minimizing the involvement of ML engineers.

Case Studies on HyperCLOVA Studio
This section shares three in-house applications powered by HyperCLOVA Studio, which are novel tasks with a large-scale LM as illustrated in Figure  1. The three in-house usages share three properties below. First, it is non-trivial to define the objective function or to evaluate the models automatically. Second, the style of the inputs and outputs is easily controlled. Lastly, a product designer, without programming skill nor knowledge of AI, can easily make Proof-of-Concept (PoC) systems within few hours.

Rapidly Prototyping Chatbots with Personalities
This subsection discusses rapid prototyping of chatbots with personalities (Smestad and Volden, 2018) using HyperCLOVA. Our chatbot designers found that HyperCLOVA allows them to build a chatbot with the persona of a specific character using one or two lines of description on the character property and few dialog examples. This process can be used Zero-shot (Acc) # of augmented samples (k) n 5(1) 10 (2) 15 (3) 25 (5)   for producing many bots in metaverse applications. Figure 1 (a) shows an example.
The style of the character can be controlled easily by changing a few dialog examples in the prompt. Knowledge in HyperCLOVA can also be implicitly extracted using the beginning of the prompt. For example, the knowledge of the famous can be reflected. Detailed discussion can be found in Appendix C.4.
PoC can be easily available, and the following human-in-the-loop process can accelerate making a bot. Based on these functions, it is possible to quickly build a dialogue system of various characteristics. HyperCLOVA Studio also supports these functionalities.

Zero-shot Transfer Data Augmentation
The task is to build utterances tailored to user intent. Given the natural language name of the user's intent, corresponding utterances are generated. For example, if you give "reservation query with one person" as the user intent name, HyperCLOVA will output sentences like "Is it OK for reservation with one person?" We formulate this problem as in-context zero-shot transfer data augmentation. We give source-domain classes and correspond- The name of intent can be simple, like "reservation inquiry" or complex, like "Complaints about the degree of steak doneness". In in-house usages, a team for managing the quality of the product uses this function to make diverse utterances to validate the dialog system. The team reported that they could easily make diverse utterances of a intent with the complicated situation using HyperCLOVA Studio.
We design a simple experiment to obtain quantitative results. We select 20 classes in an in-house intent corpus as the target domain and 6 classes with 5 examples each for the source domain. Quantitative results using the 39B model are illustrated in Table 7. See the details and discussions in Appendix C.5.

Event Title Generation
Event title generation is to generate the titles of an event for enhancing product advertisement in our e-commerce platforms. Similar to the significant effect of the product titles on CTR and revenue (Zhang et al., 2019), the product event title has a crucial influence on the product's success. Event title generation is formulated as a sequenceto-sequence task to transform keywords describing the product characteristics into an impressive event title.
For achieving this, we ask an event designer to prepare five examples including event date and keywords as a prompt to HyperCLOVA. Within less than 10 minutes of designers' effort, HyperCLOVA Studio was able to generate the candidates of sales event titles with high quality.  580M and is fine-tuned with 400K training data. For human evaluation, we asked nine human experts to pick the best expression among the titles generated by GT, mT5, and HyperCLOVA. As shown in Table 8, HyperCLOVA can yield highquality titles comparable to GT. Interestingly, we find that higher BLEU scores with respect to GT do not guarantee higher qualities (Mathur et al., 2020). On the contrary, it is worth noting that lower BLEU of HyperCLOVA implies that it can generate more creative titles, not using the exact words of GTs yet satisfying their qualities. Our system is also easy to control the theme that each designer wants to emphasize for the same keyword, such as discounting promotion, item brand, and product values. The detailed results are presented in Appendix C.6. Unlike fine-tuned models, HyperCLOVA is easy to be adapted to the events of other domains by modifying the prompts. We also share usage of the advertisement headline task in the Appendix C.6, where few training examples are available, but the prompt similar to the event title generation task achieves 99% of appropriateness for the real service.

Opportunity of HyperCLOVA Studio
HyperCLOVA Studio can boost the ability of Hy-perCLOVA by multiple additional AI functions. First, input gradient API, which gives input gradient of HyperCLOVA can be applied to enhance the performance of local downstream tasks. Even for the downstream task that the in-context learner performs well, prompt-based optimization can further boost the performance. Section 4.3 shows the possibility. Our studio can be extended to supply input gradient function to support prompt-tuning in local machines. Then each developer can also train their own prompt encoder using prompt-optimization methods, such as Autoprompt (Shin et al., 2020), ptuning (Liu et al., 2021b), or prompt tuning (Lester et al., 2021.
Second, prompt injection module can be applied. HyperCLOVA can be used for an open-domain QA reader by using adequate documents retrieved by a retriever. In general, retrieving knowledge or similar examples can boost the performance of Hy-perCLOVA.
Finally, filters for input and output are helpful for preventing misuse of HyperCLOVA. OpenAI API also provides a filter to monitor generations of sensitive or ethically inadequate sentences.

No/Low Code AI Paradigm
A typical machine learning development pipeline involves (1) problem definition and user research, (2) data gathering and annotation, (3) training and validating models, (4) deploying and operating machine learning systems (MLOps), (5) error analysis and user monitoring. It is an iterative process where any issue in one step propagates to other steps, and the need for revisiting the steps for revision and update constantly arises even after the model deployment. This is especially tedious and resource-heavy, not only because this pipeline involves different expertise and different roles, but also because there is not a shared grounded artifact to facilitate the communication between the experts.
A single large-scale LM with GUI interfacing on a prompt, like HyperCLOVA Studio, can remarkably alleviate this problem. Specifically, the 2 ∼ 4th steps of the previous five processes can be combined into one step. In the unified phase, curating examples, prompt design, API parameter tuning, and API integration can take place at once.
It is notable that an approach with a single largescale LM makes communication costs of experts be dramatically reduced. Through this, the prototype of desired AI product can be created within few hours. Though many companies want to use AI technology, it is costly to make the companies and teams to use AI techniques and gather data for AI, Therefore, there have been several discussions about strategies for adopting AI technology (Raffel et al., 2020). An approach with a single large-scale LM provides a novel paradigm to research communities and industries.
No Code AI approach is powerful when fast iteration on PoC is beneficial or when services can be solely built with pure generation ability of large-scale model. Low Code AI approach can be used where it uses some training dataset (Liu et al., 2021a) following by pre-processing code or input/output modules are required.
We discuss the challenges of achieving No/Low Code AI paradigm with large-scale LMs in Section F of the Appendix with detail.

Conclusion
We present HyperCLOVA, various billions-scale Korean-centric LMs. In particular, HyperCLOVA with 82B parameters shows state-of-the-art incontext zero-shot and few-shot performance and can further be boosted by prompt-based learning method. We will share our model by HyperCLOVA Studio where non-developers can easily build their own AI-backed products. We argue that a framework like HyperCLOVA Studio can potentially achieve No Code AI paradigm and hope that cases of such paradigm become popular, although opportunities and challenges coexist.
Our goal is to create an ecosystem using Hy-perCLOVA studio in Korea and help people not familiar with machine learning make their own AI models.

Broader Impact Statement
Since GPT3 was released, NLP and AI communities were impressed by the capability of its variants remarkably overwhelming the previous work.
Despite their great success, these hyperscale pretrained LMs raise several severe concerning issues, which may harm the sustainability of AI and society.
Misuse of large-scale LMs: The case of Tay, the chatbot developed by Microsoft in 2016 8 , is one of the most well-known misusing examples. Recently, Luda, a Korean chatbot developed by a Korean startup, suffered from serious sexual abuse by malicious users 9 . This situation brought a fundamental and social problem of whether AI can be an abused target to the surface. In Luda service, privacy issues were more critical from a legal perspective caused by incomplete data preprocessing for privacy-preserving. In addition to private information, hate speech data can lead to malicious misuse of language models when used as training data. Several GPT3 API applications also have reported these malicious usages and problematic generation results 10 .
Fairness, Bias, and Representation: Another critical problem of Luda was biased and repulsive responses on various sensitive social values including gender and racism. Many studies have already reported that these biases from training data have significant influences on large-scale language models as well (Abid et al., 2021;Garrido-Muñoz et al., 2021;Shwartz and Choi, 2020). To overcome these issues, many researchers argue the necessity of controllability when generating sentences such as filtering and investigate how to more effectively refine the data for debiasing (Tamkin et al., 2021).
Excessive Energy Consumption: Many researchers have serious concerns about too heavy energy consumption for training large-scale models, which have been recently reported by several analysis papers (Patterson et al., 2021;Bender et al., 2021). Scaling raw presents more parameters and training data are essential for better performance, which inevitably makes the energy issue worse. A plausible alternative is to use energy-efficient hardware such as FPGA.
Efforts for Positive Directions: Despite all these concerns and side effects, large-scale LMs can provide significant and innovative benefits which cannot be expected from previous AI technologies. One of the most valuable functions of large-scale LMs is the possibility of No/Low Code AI. Despite many open-source AI libraries, developing AI systems and models with a certain level of quality still requires considerable effort, experience, and corresponding data, which are an entry barrier for AI democratization. However, No/Low Code AI allows industrial engineers and online service designers not familiar with machine learning to make a simple AI system or its prototypes rapidly. This contribution is a similar case to the success of office programs such as Microsoft office. We provided our HyperCLOVA Studio for our platform service designers, who showed surprising results and performances using our Studio with their creativity. The outputs and data generated by HyperCLOVA Studio are applied to our AI services. From this result, we found the possibility of No/Low Code AI with our HyperCLOVA, which is a meaningful step to realize AI democratization. Therefore, we need strong efforts to alleviate the problematic issues while benefiting from the values that large-scale LMs can provide.

A.1 Data Description
As shown in Table 1, 49%, 15%, and 13% of the corpus come from blogs, community sites, and News corpus, respectively. 7% of the corpus consists of comments from various websites mentioned above. 5% of the corpus comes from KiN 11 , which is an online social QnA service similar to Quora. KiN corpus consists of open questions and answers written by users. Note that our corpus also includes Korean Wikipedia, but the portion is very small (0.04%). We also use Wikipedia for English and Japanese to enhance the ability of foreign languages. Modu-corpus 12 is a collection of various datasets collected by National Institute of Korean Language (NIKL). We use five datasets, including messenger, news, spoken language corpus, written language corpus, and web corpus from Moducorpus. The data ratio per language is 97%, 2%, 0.5%, 0.5% in Korean, English, Japanese, and other languages, respectively.

A.2 Data Cleaning
In a similar way to the work of Brown et al. (2020), we train a logistic regression model that can measure the quality of each document. BERT feature of the document is used as an input. We assume high-quality encyclopedia documents as positive examples and crawled web documents as negative ones. We exclude the documents predicted as lowquality. To remove duplicated documents, we calculate the similarity of the documents with a hash function. We also utilize an in-house spam filtering technique to remove undesired advertisements and documents. Moreover, we exclude low-quality documents too short in length or too repetitive at levels of graphemes, numbers, or special characters. In particular, we observe the review-type documents often contain too repetitive expressions because there is a policy on the length of writing a review. Also, if the document contains too many swear words and slang, it is excluded. Within the document, we remove duplicated sentences between title and content. In the case of KiN corpus, if multiple answers are registered for one question, only the answers adopted by the questioner or the answers from certified experts, such as doctors or lawyers, were used. Even if the answer was adopted, it was excluded if the author's reputation score was low.  We parse the HTML source code and use only meaningful parts of the HTML page for training the model. For news-type documents, we remove typical parts that have insignificant information, such as the first line and the last phrase for affiliation.

A.3 Data Anonymization
We mask the personal information such as resident registration number, email address, phone number, bank account number, credit card number, passport number, driver's license number, etc. However, we remain non-critical parts of the numbers that can't be used to identify a person. For example, we extract the age and gender from resident registration number, location information from driver's license number, dialing code from a phone number, and domain address from email.

C Details on Experiments
C.1 NSMC Table 9 shows the statistics on the performance of HyperCLOVA in NSMC. Table 10 shows the statistics on the performance of HyperCLOVA in AI Hub translation tasks.    Task   Table 11 and Table 12 show the example and the prompt for the query modification task.

C.4 Discussions on Making Persona Chatbot
Recent chit-chat with the neural model, like Meena and Blender, shows impressive conversational performance (Humeau et al., 2020;Adiwardana et al., 2020;Roller et al., 2020). However, such a conversation system uses a lot of data, and it cannot make a new style of conversational system in an instant. There are also plenty of researches on style transfer. However, these methods do not control the detailed style of the conversational system (Smith et al., 2020). There also exist some hallucination issues. Retrieved knowledge can alleviate this problem (Shuster et al., 2021). A pre-trained reader can also get advantages if the pre-trained LM itself also per-

C.5 Zero-shot Transfer Data Augmentation
HyperCLOVA does not always make sentences which is fit to the target intent class. However, even when people simply fill in the utterances that fit their intent, it is difficult to create various patterns, and data collectors struggle to make many utterances because of this problem. Data collectors can easily make a corpus by selecting sentence  candidates created by HyperCLOVA. Our corpus designer also found that generating dialect or converting standard language to dialect is also easily available, showing the capability of data augmentation with HyperCLOVA.
Note that this experiment is zero-shot transfer data augmentation, and examples of a different class from target classes are used as in-context examples. We use a total of 30 examples from six source classes and randomly sample three source classes and corresponding 15 examples to put into the prompt. For classification, an in-house BERTbased model is used.
Similar concurrent works are conducted from Schick and Schütze (2021). However, their study can only be applicable for NLI, which is a welldefined task, has good datasets, and has pre-trained models for the task. Table 14 and 18 show the example prompt for the event title generation task. Table 17 shows a qualitative comparison between mT5 and our model. Similar to the event title generation task, the product designer also does the advertisement headline generation task in a similar way. In this task, there is no training data which could be used due to data privacy issue. Nevertheless, HyperCLOVA with a similar style of event title generation task successfully generates an advertisement headline. Table 15 shows the prompt. Three different prompts are used for advertisement headline generation, and the generated sentence which is most similar to the product name, which is an input of the task, is selected. A similarity score is calculated by the cosine similarity score using a feature of the in-house BERT. The product designer evaluates that 99% of generated sentences are appropriate for the real service. Figure 3 shows the GUI interface of HyperCLOVA Studio. Figure 4 illustrates No Code AI paradigm in HyperCLOVA Studio. Figure 5 shows our motivation and importance of morpheme-aware tokenization. Though we used an in-house morpheme analyzer, an alternative opensource morpheme analyzer like Mecab-ko 14 can also be used.

F Challenges of No/Low Code AI Paradigm
Some researchers doubt the performances of GPT-3 less competitive than existing finetuning-based  LMs for various downstream tasks. For example, task-specific neural structure like FiD (Izacard and Grave, 2020) achieves state-of-the-art open-domain QA, whereas GPT-3 does not. It is still underdiscovered that a prompt-based method makes large-scale LMs competitive. To resolve this problem, further discovery on general large model capability and prompt-based optimization is required. There also exists a problem with dependency on pre-training data. If the corpus does not contain code generation, it is unfair to expect the LM generates source codes, even where a prompt-based optimization is applied. The maintainer of Hyper-CLOVA Studio may discover many requirements of users and further train corpus with common needs. To incorporate these corpora, research on pre-training under continual learning setup (Bang et al., 2021) is required.
Though we mentioned No Code AI earlier, programming further the functions of HyperCLOVA Studio still exists for the remaining part of complete AI system. Also, knowledge of ML is still required implicitly to design an effective prompt and few-shot examples. An easier guideline for Studio and incentives on sharing user's own prompts can boost to spread the ecosystem.
In order to support a full-fledged ML development, we also need additional features for Hyper-CLOVA Studio -experimentation and user feedback. In this function, a user can easily distribute PoC service by an appropriate interface, like a text editor or messenger, and make the user can feedback on responses of HyperCLOVA. For example, user can rate the response of the chatbot turn by Figure 5: Motivation of our morpheme-aware byte-level BPE tokenization. (Top) A conceptual example of making subword from three tokenization methods. (Middle) An example of tokenization, where subword from byte-level tokenizer is represented as a byte. (Bottom) The same example of (middle), but subword from byte-level tokenizer is represented as a character.

turn.
Expensive inference or prompt-based optimization costs are still an obstacle for using large-scale LMs. However, there is a trade-off on costs between training many small-scale LMs and inferencing one large-scale LM. The outputs by one large-scale LM can also be input to small-scale LMs (Yoo et al., 2021). Research on distilling generative transformers or energy-efficient hardware is essential for sustainability. Further discussion several issues are in the Broader Impact Statement section.