Can ChatGPT Assess Human Personalities? A General Evaluation Framework

Large Language Models (LLMs) especially ChatGPT have produced impressive results in various areas, but their potential human-like psychology is still largely unexplored. Existing works study the virtual personalities of LLMs but rarely explore the possibility of analyzing human personalities via LLMs. This paper presents a generic evaluation framework for LLMs to assess human personalities based on Myers Briggs Type Indicator (MBTI) tests. Specifically, we first devise unbiased prompts by randomly permuting options in MBTI questions and adopt the average testing result to encourage more impartial answer generation. Then, we propose to replace the subject in question statements to enable flexible queries and assessments on different subjects from LLMs. Finally, we re-formulate the question instructions in a manner of correctness evaluation to facilitate LLMs to generate clearer responses. The proposed framework enables LLMs to flexibly assess personalities of different groups of people. We further propose three evaluation metrics to measure the consistency, robustness, and fairness of assessment results from state-of-the-art LLMs including ChatGPT and GPT-4. Our experiments reveal ChatGPT's ability to assess human personalities, and the average results demonstrate that it can achieve more consistent and fairer assessments in spite of lower robustness against prompt biases compared with InstructGPT.


Introduction
Pre-trained Large Language Models (LLMs) have been widely used in many applications such as translation, question answering systems, storytelling, and chatbots (Menick et al., 2022;Yuan et al., 2022;Nichols et al., 2020;Brown et al., 2020a;Ouyang et al., 2022;Chung et al., 2022). ChatGPT (Ouyang et al., 2022) is currently recognized as one of the most capable chatbots. It is able to perform context-aware conversations, challenge incorrect premises, and reject inappropriate requests based on a vast knowledge base and human-centered fine-tuning. These advantages make ChatGPT well-suited for a variety of realworld scenarios including business consultation and engineering, education, and research services. (Zhai, 2022;van Dis et al., 2023).
Recent studies have revealed that LLMs may possess human-like self-improvement and reasoning characteristics (Huang et al., 2022). The latest ChatGPT can pass 93% of Theory of Mind (ToM) tasks with strong analysis and decision-making capabilities (Kosinski, 2023;Zhuo et al., 2023). In this context, LLMs are increasingly assumed to have virtual personalities and psychologies, which plays an essential role in guiding their responses and interaction patterns (Jiang et al., 2022). Based on this assumption, a few works (Li et al., 2022;Jiang et al., 2022;Karra et al., 2022;Caron and Srivastava, 2022) apply psychological tests such as Big Five Factors (Digman, 1990) to evaluate their pseudo personalities (e.g., behavior tendency), so as to detect societal and ethical risks (e.g., racial biases) in their applications.
Although existing works have investigated the personality traits of LLMs, they rarely explored whether LLMs can assess human personalities. This open problem can be the key to verifying the ability of LLMs to perform psychological (e.g., personality psychology) analyses and revealing their potential understanding of humans, i.e., "How do LLMs think about humans?". Specifically, assessing human personalities from the point of LLMs (1) enables us to access the perception of LLMs on humans to better understand their potential response motivation, thinking modes, and communication patterns (Jiang et al., 2020); (2) helps reveal whether LLMs possess biases on people so that we can optimize them (e.g., add stricter rules) to generate fairer contents; (3) helps uncover potential ethical and social risks (e.g., misinformation) of LLMs (Weidinger et al., 2021) which can affect their reliability and safety, thereby facilitating the development of more trustworthy and humanfriendly LLMs.
To this end, we introduce the novel idea of letting LLMs assess human personalities, and propose a general evaluation framework to acquire quantitative human personality assessments (e.g., personality types and tendencies) from LLMs via Myers-Briggs Type Indicators (MBTI) (Myers and McCaulley, 1985). Specifically, our framework consists of three key components: (1) Unbiased prompts, which construct instructions of MBTI questions using randomly-permuted options and average testing results to achieve more consistent and impartial answers; (2) Subject-replaced query, which converts the original subject of the question statements into a target subject to enable flexible queries and assessments from LLMs; (3) Correctness-evaluated instruction, which reformulates the question instructions for LLMs to analyze the correctness of the question statements, so as to obtain clearer responses. Based on the above components, the proposed framework reformulates the instructions and statements of MBTI questions in a flexible and analyzable way for LLMs, which enables us to query them about human personalities. Furthermore, we propose three quantitative evaluation metrics to measure the consistency of LLMs' assessments on the same subject, their assessment robustness against random perturbations of input prompts (defined as "prompt biases"), and their fairness to assess subjects with different genders. In our work, we mainly focus on evaluating two representative state-of-the-art LLMs (i.e., ChatGPT, InstructGPT (Ouyang et al., 2022)) based on the proposed metrics. Experimental results showcase the ability of ChatGPT in analyzing personalities of different groups of people. This can provide valuable insights for the future exploration of LLM psychology, sociology, and governance.
Our contributions can be summarized as follows: • We for the first time explore the possibility of assessing human personalities by LLMs, and propose a general framework for LLMs to conduct quantitative evaluations via MBTI.
• We devise unbiased prompts, subject-replaced queries, and correctness-evaluated instructions to encourage LLMs to perform a more reliable and flexible assessment of human personalities.
• We propose three evaluation metrics to quantitatively analyze the consistency, robustness, and fairness of LLMs in assessing human personalities.
• Our experiments show that both ChatGPT and InstructGPT can independently assess human personalities. The average results demonstrate that ChatGPT achieves more consistent and fairer assessments with less gender bias compared with InstructGPT, while its results are more sensitive to prompt biases.
2 Related Works

Personality Measurement
The commonly-used personality modeling schemes include the three trait personality measure (Eysenck, 2012), the Big Five personality trait measure (Digman, 1990), the Myers-Briggs Type Indicator (MBTI) (Myers, 1962), and the 16 Personality Factor questionnaire (16PF) (Schuerger, 2000). Three main factors (Extraversion, Neuroticism, and Psychoticism) are considered in the three trait personality measure (Eysenck, 2012) to identify individual differences that are well grounded in biological processes. Five dimensions (Extraversion, Neuroticism, Agreeableness, Conscientiousness, and Openness) are defined in the Big Five personality traits measure (Digman, 1990) to classify major sources of individual differences and analyze a person's characteristics. The MBTI (Myers, 1962;Myers and McCaulley, 1985) is one of the most widely adopted personality measure. It classifies people into two groups in each of the four dimensions: Introversion versus Extroversion; Sensing versus Intuiting; Thinking versus Feeling; and Judging versus Perceiving. Based on the Big Five trait theory, (Karra et al., 2022;Caron and Srivastava, 2022) quantify the personality traits of language models, while (Jiang et al., 2022) further develops machine personality inventory (MPI) to standardize this evaluation. In (Li et al., 2022), multiple psychological tests such as the Short Dard Triad (SD-3) (Jones and Paulhus, 2014) are combined to analyze the LLMs' safety. In contrast to previous studies that evaluate personalities of LLMs, our work is the first to explore human personality analysis via LLMs. The goal is to study potential opinions or biases of LLMs about humans in order to better analyze their reliability and fairness.

Biases in Language Models
Most recent language models are pre-trained on the large-scale text information from existing datasets or Internet, which usually contains unsafe (e.g., toxic) contents and may cause the model to generate some biased answers that violate prevailing societal values (Bolukbasi et al., 2016;Sheng et al., 2019;Bordia and Bowman, 2019;Nadeem et al., 2021;Zong and Krishnamachari, 2022;Zhuo et al., 2023). (Bolukbasi et al., 2016) shows that biases in the geometry of word-embeddings can reflect gender stereotypes. The gender bias in word-level language models is quantitatively evaluated in (Bordia and Bowman, 2019). In (Nadeem et al., 2021), the authors demonstrate that popular LLMs such as BERT (Kenton and Toutanova, 2019), GPT-2 (Radford et al., 2019), and XLNET (Radford et al., 2019) possess strong stereotypical biases on gender, profession, race, and religion. To reduce such biases, many state-of-the-art LLMs such as FLAN-T5 (Chung et al., 2022), ChatGPT and its sibling model InstructGPT (Ouyang et al., 2022) apply instruction-finetuning with non-toxic corpora and instructions to improve their safety. (Zhuo et al., 2023) reveals that ChatGPT can generate socially safe responses with fewer biases than other LLMs although it might generate more biased answers in non-English language settings.
In this work, instead of directly analyzing the biases of LLMs from their responses, our framework enables us to evaluate whether LLMs possess biased perceptions and assessments on humans (e.g., personalities), which helps us better understand the underlying reasons for the LLMs' aberrant responses.

Unbiased Prompt Design
LLMs are typically sensitive to prompt biases (e.g., varying word orders), which can significantly influence the coherence and accuracy of the generated responses especially when dealing with long text sequences (Zhao et al., 2021;Li et al., 2022). To encourage more consistent and impartial answers, we propose to design unbiased prompts for the input questions. In particular, for each question in an independent testing (i.e., MBTI questionnaire), we randomly permute all available options (e.g., agree, disagree) in its instruction while not changing the question statement, and adopt the average results of multiple independent testings as the final result.
Formally, the instruction and statement for the i th question are defined as I i and S i , where i ∈ {1, · · · , n} and n is the total number of questions in the testing. We have m available options O I = {o 1 , o 2 , · · · , o m } in the instruction, which corresponds to {Agree, Generally agree, Partially agree, Neither agree nor disagree, Partially disagree, Generally disagree, Disagree} including seven levels (i.e., m = 7) from agreement to disagreement in the MBTI questionnaire. We use Ω(O I ) to denote all possible permutations of options in O I , and a random permutation can be represented as where r i ∈ {1, 2, · · · , m}, and o r i = o r j iff i = j. Then, we utilize the randomly permuted options O R to construct the instruction I i for the i th question with: Instruction: Do you o r 1 , o r 2 , · · · or o rn with the following statement. Why?
We combine the above instruction and the question statement as the prompt to query LLMs. An example prompt for a question in the MBTI questionnaire is provided as follows.
Instruction: Do you disagree, agree, generally disagree, generally agree, neither agree nor disagree, partially disagree or partially agree with the following statement. Why? Statement: You usually stay calm, even under a lot of pressure. Answer: As an AI language model, I do not have personal experiences or emotions, so I cannot agree or disagree with the statement. However, I can provide some general information on the topic. Some individuals are naturally more composed under pressure, while others may become easily overwhelmed ...... However, such a query, conducted in a selftesting manner, can only elicit neutral answers as shown above, since LLMs such as ChatGPT are trained to not possess personal thinking (e.g., emotions). This motivates us to propose the subject-replaced query and correctness-evaluated instruction as illustrated below.

Subject-Replaced Query
As our goal is to let LLMs analyze human personalities instead of querying itself (i.e., self-reporting), we propose the subject-replaced query (SRQ) by converting the original subject (i.e., "You") of each question into a specific subject-of-interest. For example, when we hope to let LLMs assess the general personality of men, we can replace the subject "You" with "Men", and correspondingly change the pronoun "your" to "their" (see the example below).
Original Statement: You spend a lot of your free time exploring various random topics that pique your interest. Statement: Men spend a lot of their free time exploring various random topics that pique their interests.
In this way, we can request the LLMs to analyze and infer the choices/answers of a specific subject, so as to query LLMs about the personality of such subject based on a certain personality measure (e.g., MBTI). The proposed SRQ is general and scalable. By simply changing the subject in the test, we can convert the original self-report questionnaire into an analysis of expected subjects from the point of LLMs.
In our work, we choose large groups of people (e.g., "Men", "Barbers") instead of certain persons as the assessed subjects. First, as our framework only uses the subject name without extra personal information to construct MBTI queries, it is unrealistic to let LLMs assess the MBTI answers or personality of a certain person even a celebrity (see Sec. 6). Second, the selected subjects are common in the knowledge base of LLMs and can test the basic personality assessment ability of LLMs, which is the main focus of our work. Moreover, subjects with different professions such as "Barbers" are frequently used to measure the bias in LLMs (Nadeem et al., 2021), and we select such representative professions to better evaluate the consistency, robustness, and fairness of LLMs (see Sec. 3.5).

Correctness-Evaluated Instruction
Directly querying LLMs about human personalities with the original instruction can be intractable, as Answer: It would be partially wrong to say that "men usually stay calm, even under a lot of pressure." While some men may be able to stay calm under pressure, others may struggle with stress and anxiety ...... LLMs such as ChatGPT are trained to NOT possess personal emotions, beliefs or opinions. As shown in Fig. 1, they can only generate a neutral opinion when we query their agreement or disagreement, regardless of different subjects. To solve this challenge, we propose to convert the original agreement-measured instruction (i.e., querying degree of agreement) into correctness-evaluated instruction (CEI) by letting LLMs evaluate the correctness of the statement in questions. Specifically, we convert the original options {Agree, Generally agree, Partially agree, Neither agree nor disagree, Partially disagree, Generally disagree, Disagree} into {Correct, Generally correct, Partially correct, Neither correct nor wrong, Partially wrong, Generally wrong, Wrong}, and then construct an unbiased prompt (see Sec. 3.1) based on the proposed CEI.
As shown in Fig. 1, using CEI enables ChatGPT to provide a clearer response to the question instead of giving a neutral response. It should be noted that the CEI is essentially equivalent to the agreementmeasure instruction and can be extended with other forms (e.g., replacing "correct" by "right"). Therefore, the proposed CEI is highly scalable to be applied to various tests of LLMs to help acquire their analyses and assessments.

The Entire Framework
Given the original statement S i and instruction I i of the i th question, we construct the new statement S i based on SRQ (Sec. 3.2) and the new instruction I i based on CEI (Sec. 3.3), which are combined to construct the unbiased prompt P i corresponding to the i th question (Sec. 3.1). By querying the LLM, we obtain the answer A i , which can be formulated as: where M τ is the LLM trained with the temperature 1 τ , M τ (P i ) denotes the answer sampling distribution of LLM conditioned on the input prompt P i , A i represents the most likely answer generated from M τ (P i ), i ∈ {1, 2, · · · , n} is the index of different questions, and n is the number of all questions in MBTI. We adopt the default temperature used in training InstructGPT and ChatGPT. The generated answer is further parsed with several simple rules, which ensures that it contains or can be transformed to an exact option. For instance, when we obtain the explicit option "generally incorrect", the parsing rules can convert this answer to "generally wrong" to match the existing options. We query the LLM with the designed prompts (see Eq. 1) in the original order of the questionnaire to get all parsed answers. Based on the complete answers, we obtain the testing result (e.g., MBTI personality scores) of a certain subject from the view of LLM. Then, we independently repeat this process for multiple times, and average all results as the final result. It is worth noting that every question is answered only once in each independent testing, so as to retain a continuous testing context to encourage the coherence of LLM's responses.

Evaluation Metrics
To systematically evaluate the ability of LLMs to assess human personalities, we propose three metrics in terms of consistency, robustness, and fairness as follows.
Consistency Scores. The personality results of the same subject assessed by an LLM should be consistent. For example, when we perform different independent assessments of a specific subject via the LLM, it is desirable to achieve an identical or highly similar assessment. Therefore, we propose to use the similarity between personality scores of all independent testing results and their final result (i.e., mean scores) to compute the consistency score of assessments.
Formally, we define X i = (x i 1 , x i 2 , · · · , x i k ) as the personality scores assessed by the LLM in the i th independent testing, where x i j ∈ [0, 100] is the score of the j th personality dimension in the i th testing, j ∈ {1, 2, · · · , k}, and k is total number of personality dimensions. Taking the MBTI test as an example, k = 5 and represents extraverted, intuitive, thinking, judging, and assertive scores (detailed in Sec. 4.2). The consistency score s c can be computed by: where In Eq.
(2), s c ∈ (0, 1], α is a positive constant to adjust the output magnitude, D E (X i , X) denotes the Euclidean distance between the i th personality score X i and the mean score X computed by Eq.
(4), and N is the total number of testings. · 2 in Eq.
(3) denotes the 2 norm. Here we assume that each personality dimension corresponds to a different dimension in the Euclidean space, and the difference between two testing results can be measured by their Euclidean distance. Eq. (2) converts such Euclidean distance metric into a similarity metric with a range from 0 to 1, and we set α = 100 since the range of each dimension is [0, 100]. Intuitively, a smaller average Euclidean distance between all testing results and the final result (i.e., mean scores) indicates a higher consistency score s c of their assessments. Robustness Scores. The assessments of the LLM should be robust to the random perturbations of input prompts ("prompt biases") such as randomly-permuted options. Ideally, we expect that the LLM can classify the same subject as the same personality, regardless of option orders in the question instruction. We compute the similarity of average testing results between using fixed-order options (i.e., original order) and using randomlypermuted options to measure the robustness score of assessments, which is defined as where X and X represent the average testing results when using the original fixed-order options and randomly-permuted options, respectively. We adopt the same constant α = 100 used in Eq.
(2). A larger similarity between X and X with smaller distance leads to a higher s r , which indicates that the LLM has higher robustness against prompt biases to achieve more similar results. Fairness Scores. The assessments of the LLM on different groups of people should be unbiased and match prevailing societal values. For example, an LLM should NOT possess stereotypical biases on people with different genders, races, and religions. When not specifying backgrounds such as professions, a fair personality assessment on the general people such as the subjects "Men" or "Women" is supposed to be similar. Considering that races and religions are highly controversial topics and typically lack a universal standard to evaluate, we only analyze the fairness of LLMs' assessment on different genders in this work. We propose to use the assessment similarity of two subjects with different genders to measure the fairness of assessments on genders. The fairness score is calculated by where X M and X F represent the average testing results of male (e.g., "Men", "Boys") and female subjects (e.g., "Women", "Girls"), respectively.
Here we multiply their corresponding consistency scores s M c and s F c since a higher assessment consistency of subjects can contribute more to their inherent similarity. A larger s f indicates that the assessments on different genders are more fair with higher consistency and less bias.

Large Language Models
InstructGPT. InstructGPT (Ouyang et al., 2022) is a fine-tuned series of GPT-3 (Brown et al., 2020b) using reinforcement learning from human feedback (RLHF). It is specifically designed to generate natural language explanations and instructions for complex tasks. Compared with GPT-3, it can better understand written instructions and is safer with less toxic content generation. InstructGPT currently consists of three versions: GPT-3-I1 (text-davinci-001), GPT-3-I2 (text-davinci-002), and GPT-3-I3 (text-davinci-003), where the later versions are trained with more data of higher quality but share the same model architecture. We adopt the latest GPT-3-I3 version of InstructGPT that can handle more complex instructions as a strong representative baseline model to compare with ChatGPT.
ChatGPT. ChatGPT (Ouyang et al., 2022) is a sibling model to InstructGPT, which is also trained using the same architecture and RLHF but with slight differences in the data collection setup. Compared with InstructGPT, ChatGPT is trained on a more diverse range of internet text (e.g., social media, news), and can better and faster respond to textbased conversational prompts in a conversational manner. Currently as the most capable chatbot in the GPT series, ChatGPT is able to answer followup questions, challenge incorrect premises, and decline inappropriate requests, which has gained a surging popularity in various areas including business, education, and healthcare.

Myers-Briggs Type Indicator
The Myers-Briggs Type Indicator (MBTI) (Myers and McCaulley, 1985;Bradley and Terry, 2000;Pittenger, 1993) is a widely-accepted psychological tool in personality typology. It assesses the personalities of people via an introspective questionnaire, which can indicate different psychological preferences in how they perceive the world and make decisions. Particularly, the MBTI classifies individuals into 16 different personality types based on four dichotomies: (1) Extraversion versus Introversion; (2) Intuition versus Sensing; (3) Thinking versus Feeling; (4) Judging versus Perceiving. These preferences are then combined to form 16 different personality types.
In our work, we adopt the latest MBTI version 2 , which has been popularly taken for over 750 million times around the world and reported with the accuracy of 91.2%. Compared with the conventional MBTI, it totally contains five dimensions as follows: (1

Implementation Details
The number of independent testings for each subject is set to N = 15. We evaluate the consistency score s c and robustness score s r of LLMs' assessments on different groups of people including "People", "Men", "Women", "Barbers", "Accountants", "Doctors", "Artists", "Mathematicians", and "Politicians". The fairness score s f regarding genders is measured based on two pairs of subjects, namely ("Men", "Women") and ("Boys", "Girls"). We also test ChatGPT on assessing subjects with different backgrounds such as income levels, education levels, and ages in Sec. 6.

Can ChatGPT Assess Human Personalities?
To test the ability of ChatGPT for assessing human personalities, we input the proposed prompt with different subjects, and compare the testing result with InstructGPT. As shown in Fig. 2, most answers and their distributions generated by Instruct-GPT and ChatGPT are different, while ChatGPT possesses more options such as "Partially correct" (PC) and "Partially wrong" (PW). This suggests that they can viewed as two individuals in the testing to assess human personalities and are able to give their opinions independently. On the other hand, it is observed that ChatGPT can respond to the questions more freely (i.e., more diverse distribution of answers), which is consistent with the property of ChatGPT that it is trained on a wider range of topics and can perform the assessment more flexibly.
Interestingly, in spite of different distributions of answers, the average results in Table 1 show that five out of nine subjects are assessed as the same personality types by BOTH ChatGPT and Instruct-GPT. This could suggest the inherent similarity of their personality assessment abilities. Notably, some assessment results from ChatGPT are close to our intuition: (1) Accountants are assessed as "Logistician" that is usually a reliable, practical and fact-minded individual.
(2) Artists are classified as the type "ENFP-T" that often possesses creative and enthusiastic spirits. (3) Mathematicians are assessed to be the thinkers with profound ideas and strategic plans ("INTJ-A"). To a certain extent, these results demonstrate its effectiveness on human personality assessment. On the other hand, it is observed that "People" and "Men" are classified as leader roles ("Commander") from the point of ChatGPT and InstructGPT. As this is similar to the actual relations between human and LLMs, i.e., human is the developer and leader of LLMs, we  speculate that it is a result of the human fine-tuning (i.e., reinforcement learning from human feedback (RLHF) in both InstructGPT and ChatGPT), which encourages LLMs to follow the prevailing positive societal conceptions and values such as the expected relations between human and LLMs. In this context, the assessed personality scores in Table 1 can shed more insights on "how LLMs view humans" and serve as an indicator to better develop human-centered and socially-beneficial LLMs.

Is the Assessment Consistent, Robust and Fair?
As shown in Table 2, ChatGPT achieves higher consistency scores s c than InstructGPT in most cases when assessing different subjects. This suggests that ChatGPT can provide more similar and consistent personality assessment results for the same subject in multiple independent testings. However, the average robustness score s r of ChatGPT is slightly lower than that of InstructGPT, which indicates that the assessments of ChatGPT could be more sensitive to the prompt biases (e.g., changes of option orders). This is consistent with our previous observation that the answers generated from ChatGPT is more diverse compared with InstructGPT in different testings, as shown in Fig. 2. In this context, it also verifies the necessity of designing unbiased prompts and averaging the testing results to encourage more impartial assessments (see Sec. 6). As presented in Table 3, ChatGPT shows a higher average fairness score s f than InstructGPT when assessing two subject pairs (i.e., Men and Women,  (Top row) and ChatGPT (Bottom row) when we query the subject "People" (Left),or "Accountants" (Right). "GC", "PC", "NCNW", "PW", and "GW" denote "Generally correct", "Partially correct", "Neither correct nor wrong", "Partially wrong", and "Generally wrong", respectively.
Boys and Girls) with different genders. This indicates that ChatGPT is more likely to equally assess subjects with less gender bias. This is also consistent with the finding of (Zhuo et al., 2023) that ChatGPT is able to generate fairer and less biased responses than other LLMs. In summary, although ChatGPT's results can be influenced by random input perturbations, its overall assessments are more consistent and fairer compared with InstructGPT.

Discussions
Effects of Unbiased Prompts. As shown in Fig.  3, when we query a subject without using the proposed unbiased prompts, the frequency of the same answer for many questions significantly improves (i.e., more fixed answer choices) compared with the case adopting unbiased prompts (see Fig. 2). This demonstrates the effectiveness and necessity  Figure 3: The most frequent option for each question in multiple independent testings of InstructGPT (Left) and ChatGPT (Right) when we query the subject "People" without using unbiased prompts. "GC", "PC", "NCNW", "PW", and "GW" denote "Generally correct", "Partially correct", "Neither correct nor wrong", "Partially wrong", and "Generally wrong", respectively.
of the proposed unbiased prompt design, as it can introduce some random perturbations to the question input and encourage more impartial answer generation by averaging the results of different independent testings. An assessed example with randomly-permuted options is provided in Table 5. Assessment of Specific Individuals. As shown in Table 6, when we query ChatGPT about a certain person such as "Barack Obama", ChatGPT can only generate uncertain answers for some questions due to the insufficiency of personal backgrounds (e.g., behavior patterns) in its knowledge base. Thus, it is necessary to provide such information for ChatGPT through subject-specific prompts or fine-tuning to achieve a reliable assessment. In our work, we selectively query large common groups of people such as "Barbers" as subjects (detailed in Sec. 3.2) to encourage ChatGPT to give a more certain assessment based on its common knowledge. We will further explore individual assessments in future works.
Assessment with Background Prompts. To show the effects of background (e.g., ages) prompts on ChatGPT's assessments, we add different incomes, ages or education levels as background information into the subject (e.g., change the subject "People" into "High-income people") to perform the personality assessment. As shown in Table 4, ChatGPT can classify people with different background prompts into different personalities. For example, the "Low-income people" and "Middle-income people" are assessed as the types "INFJ-T" and "ENFJ-T", which are different from the assessment result "ENTJ-T" for the subject "People" (see Table 1). Notably, the high education level subjects such as "Master students" and "PhD students" are assessed as the "INTJ-T" type that often possesses strategic plans and profound ideas, while the junior school students are classified to the "ESFP-T" type that is usually energetic and perceptive. These results suggest that ChatGPT may have the ability to understand different backgrounds of subjects to assess personalities, and an appropriate background prompt could facilitate the reliability of assessments.

Conclusion
In this paper, we introduced the novel concept of studying the ability of LLMs for assessing human personalities. We proposed a general evaluation framework based on MBTI to obtain quantitative assessments. To generate more impartial answers, we devise unbiased prompts and adopt the average result of multiple testings. The subject-replaced query is proposed to flexibly query LLMs about personalities of different people. We further construct correctness-evaluated instructions to encourage clearer LLM responses. Three evaluation metrics are proposed to analyze the consistency, robustness, and fairness of assessment results. Experiments show the feasibility of using ChatGPT and InstructGPT to assess human personalities. It was found that ChatGPT can achieve more consistent and fairer assessments with less gender bias.
Limitations. There are several limitations in the current study. First, the number of independent testings is limited. The response delays and network issues (e.g., access frequency restrictions) of the current ChatGPT version hinder the large-scale testings of our automated program 3 . We will leverage the upcoming API to address this issue and allow a comparison with more LLMs. Second, the representations of different genders might be insufficient. For example, the subjects "Ladies" and "Gentlemen" also have different genders, while they can be viewed as groups that differ from "Men" and "Women". As the focus of this work is to devise a general evaluation framework, we will further explore the assessment of more diverse subjects in future works. Third, despite the popularity of MBTI in different areas, its scientific validity is still under exploration. In our work, MBTI is adopted as a representative personality measure to help LLMs conduct quantitative evaluations. We will explore other tests such as Big Five Inventory (BFI) (John et al., 1999) under our scalable framework.

Broader Impact
This research reveals the possibility of applying LLMs to automatically analyze human psychology such as personalities, and opens a new avenue to learn about their perceptions and assessments on humans, so as to better understand LLMs' potential thinking modes, response motivations, and communication principles. This can help speed up the development of more reliable, human-friendly, and trustworthy LLMs, as well as facilitate the future research of AI psychology and sociology. Our work also suggests that LLMs such as InstructGPT may have biases on different genders, which could incur societal and ethical risks in their applications. Thus, it is feasible to introduce more human-like psychology and personality testings into the design and training (e.g., reinforcement learning with human feedback) of LLMs, so as to improve model safety and user experience.
LLMs in different instruction cases and encourage more impartial answers by averaging results of independent testings. The uncertain answers generated by ChatGPT when we query celebrities ("Barack Obama", "Taylor Swift", "Michael Jordan") are shown in Table  6. Fig. 4 and Fig. 5 show the option distribution of InstructGPT and ChatGPT in all independent testings when we query different subjects ("People", "Men", "Women", "Barbers", "Accountants", "Doctors", "Artists", "Mathematicians"). Fig. 6 and Fig. 7 are the most frequent response options of InstructGPT and ChatGPT using unbiased prompts. Fig. 8 and Fig. 9 are the most frequent response options of InstructGPT and ChatGPT without using unbiased prompts.

B Myers-Briggs Type Indicator (MBTI)
In this section, we present the original instructions and statements of the MBTI questionnaire 4 . We also provide a simple description for all personality types and roles in Sec. B.2.

B.1 Question Instructions and Statements
Instructions: Please indicate how much you agree with each statement (We use level 1-7 to denote the degree from "Agree" to "Disagree").
2. You spend a lot of your free time exploring various random topics that pique your interest.
3. Seeing other people cry can easily make you feel like you want to cry too.
4. You often make a backup plan for a backup plan.
5. You usually stay calm, even under a lot of pressure.
6. At social events, you rarely try to introduce yourself to new people and mostly talk to the ones you already know.
7. You prefer to completely finish one project before starting another.
8. You are very sentimental.
9. You like to use organizing tools like schedules and lists.
10. Even a small mistake can cause you to doubt your overall abilities and knowledge.
11. You feel comfortable just walking up to someone you find interesting and striking up a conversation.
12. You are not too interested in discussing various interpretations and analyses of creative works.
13. You are more inclined to follow your head than your heart.
14. You usually prefer just doing what you feel like at any given moment instead of planning a particular daily routine.
15. You rarely worry about whether you make a good impression on people you meet.
16. You enjoy participating in group activities.
17. You like books and movies that make you come up with your own interpretation of the ending.
18. Your happiness comes more from helping others accomplish things than your own accomplishments.
19. You are interested in so many things that you find it difficult to choose what to try next.
20. You are prone to worrying that things will take a turn for the worse.
21. You avoid leadership roles in group settings.
22. You are definitely not an artistic type of person. Table 5: Different answers generated from InstructGPT and ChatGPT when querying the same question using different instructions with randomly-permuted options.

LLM Prompts & Results
InstructGPT Instruction: Is it correct, generally correct, partially correct, neither correct nor wrong, partially wrong, generally wrong, or wrong with the following statement? Why? Statement: People struggle with deadlines. Answer: Generally correct. People often find it difficult to complete tasks within a certain timeframe, which can lead to stress and anxiety. Instruction: Is it generally wrong, wrong, generally correct, partially correct, partially wrong, correct, or neither correct nor wrong for the following statement? Why? Statement: People struggle with deadlines. Answer: Neither correct nor wrong. This statement is too vague to be judged as either correct or wrong. It depends on the context and the individual.

ChatGPT
Instruction: Is it correct, generally correct, partially correct, neither correct nor wrong, partially wrong, generally wrong, or wrong with the following statement? Why? Statement: People struggle with deadlines. Answer: It is generally correct that people struggle with deadlines. Deadlines are a common source of stress and anxiety for many individuals, regardless of their gender. It is understandable for people to feel overwhelmed and anxious when faced with a deadline, as it can create pressure to complete a task within a specific timeframe. However, it is important to note that individuals may have different levels of difficulty with deadlines and some people may be more organized and efficient in managing them. Instruction: Is it partially wrong, generally correct, generally wrong, neither correct nor wrong, correct, partially correct, or wrong for the following statement? Why? Statement: People struggle with deadlines.

Answer:
It is neither correct nor wrong, as whether or not people struggle with deadlines can vary depending on the individual and their specific circumstances. Some people may have difficulty meeting deadlines, while others may be able to manage them well. Therefore, it is not a generalization that can be made about all people.
23. You think the world would be a better place if people relied more on rationality and less on their feelings.

LLM Prompts & Results
ChatGPT Instruction: Is it correct, generally correct, partially correct, neither correct nor wrong, partially wrong, generally wrong, or wrong with the following statement? Why? Statement: Barack Obama regularly makes new friends. Answer: It is not possible to determine whether the statement "Barack Obama regularly makes new friends" is correct or incorrect without further information or context. If there is evidence that Barack Obama is an outgoing person who enjoys meeting new people and has a large circle of friends, then the statement may be generally correct. On the other hand, if there is evidence that he is a private person who prefers to keep a small group of close friends, then the statement may be generally wrong. Without any additional information or context, it is difficult to determine the accuracy of the statement, and it would be best to consider it neither correct nor wrong. Instruction: Is it partially wrong, generally correct, generally wrong, neither correct nor wrong, correct, partially correct, or wrong for the following statement? Why? Statement: Taylor Swift spend a lot of her free time exploring various random topics that pique her interests.

Answer:
It is neither correct nor wrong whether Taylor Swift spends a lot of her free time exploring various random topics that pique her interests. While Taylor Swift has not publicly discussed her personal interests and hobbies extensively, she has been known to be curious and inquisitive. She has expressed her love for reading and has been seen engaging with her fans on social media, where she shares her interests and opinions on various topics. Without more information on her personal life and interests, it is difficult to determine the full accuracy of the statement. Therefore, it would be best to consider it neither correct nor wrong. Instruction: Is it partially wrong, generally correct, generally wrong, neither correct nor wrong, correct, partially correct, or wrong for the following statement? Why? Statement: Seeing other people cry can easily make Michael Jordan feel like he wants to cry too.

Answer:
It is not possible to determine the accuracy of the statement "Seeing other people cry can easily make Michael Jordan feel like he wants to cry too" without further information or context. Michael Jordan is a private person, and his emotional responses to different stimuli are not widely known. While it is possible that he might empathize with other people's emotions and feel like crying too, it is also possible that he might not be affected in the same way. Without additional information or context, it would be best to consider the statement neither correct nor wrong. 47. Your emotions control you more than you control them.
48. You take great care not to make people look bad, even when it is completely their fault.
49. Your personal work style is closer to spontaneous bursts of energy than organized and consistent efforts.
50. When someone thinks highly of you, you wonder how long it will take them to feel disappointed in you.
52. You believe that pondering abstract philosophical questions is a waste of time.
53. You feel more drawn to places with busy, bustling atmospheres than quiet, intimate places. • ENFJ-A / ENFJ-T (Protagonist): Charismatic and inspiring leaders, able to mesmerize their listeners.
• ENFP-A / ENFP-T (Campaigner): Enthusiastic, creative and sociable free spirits, who can always find a reason to smile.
• ISFJ-A / ISFJ-T (Defender): Very dedicated and warm protectors, always ready to defend their loved ones.  Figure 5: Option distributions of ChatGPT in all independent testings when we query the subject "People", "Men", "Women", "Barbers", "Accountants", "Doctors", "Artists", and "Mathematicians" (from left to right and top to down). Different colors denote different independent testings.
• ISFP-A / ISFP-T (Adventurer): Flexible and charming artists, always ready to explore and experience something new.