Multilingual Coarse Political Stance Classification of Media. The Editorial Line of a ChatGPT and Bard Newspaper

Neutrality is difficult to achieve and, in politics, subjective. Traditional media typically adopt an editorial line that can be used by their potential readers as an indicator of the media bias. Several platforms currently rate news outlets according to their political bias. The editorial line and the ratings help readers in gathering a balanced view of news. But in the advent of instruction-following language models, tasks such as writing a newspaper article can be delegated to computers. Without imposing a biased persona, where would an AI-based news outlet lie within the bias ratings? In this work, we use the ratings of authentic news outlets to create a multilingual corpus of news with coarse stance annotations (Left and Right) along with automatically extracted topic annotations. We show that classifiers trained on this data are able to identify the editorial line of most unseen newspapers in English, German, Spanish and Catalan. We then apply the classifiers to 101 newspaper-like articles written by ChatGPT and Bard in the 4 languages at different time periods. We observe that, similarly to traditional newspapers, ChatGPT editorial line evolves with time and, being a data-driven system, the stance of the generated articles differs among languages.


Introduction
Instruction-following language models (ILMs) are omnipresent.Their use is not so extended as that of search engines yet, but due to the availability and high quality of systems and models such as Alpaca (Taori et al., 2023), Bard (Google, 2023), BLOOMZ and mT0 (Muennighoff et al., 2023), ChatGPT (OpenAI, 2023), Llama 2-chat (Touvron et al., 2023), or Koala (Geng et al., 2023), their use is expected to be more common in the near future.
These models face several problems being the most relevant the lack of trustworthiness (van Dis et al., 2023;Huang et al., 2023;Wang et al., 2023a).They are not ready to be used as a source of reliable information if their outputs are not fact checked.A second big issue with systems based on language models (LM) is the fact that they might reproduce the biases present in the training data (Navigli et al., 2023).Biases range from cultural missrepresentation due to data imbalance to offensive behaviour reproduced from written texts.LMs are finetuned into ILMs either in a supervised way using input-output pairs and an instruction (Wei et al., 2022;Wang et al., 2022Wang et al., , 2023b) ) or with reinforcement learning from human feedback (Ouyang et al., 2022;Nakano et al., 2021).In both cases, the finetuning should help removing bias.But neutrality is something very difficult to achieve, also for the humans that generate the supervisory data.The finetuning phase might therefore over correct the original biases or introduce new ones.For methods that generate the supervision data with the LM itself, the original biases might be inherited.
We focus on a specific use of ILMs: the writing of newspaper articles.Journals and newspapers follow an editorial line which is in general know to the reader.Besides, sites such AllSides,1 Media Bias Fact Check2 (MB/FC), or Ad Fontes Media3 provide ratings about the political bias of (mostly USA) media sources and their quality with respect to factual information.With these ratings, conscientious readers can make informed decisions about which media outlets to choose in order to get a balanced perspective.But what happens when journalists use systems such as ChatGPT or Bard to aid in their writing?As said above, humans also have biases, the danger lies in being unaware of them, as they might affect the user's/reader's perspective (Jakesch et al., 2023;Carroll et al., 2023).Chat-GPT already warns its users about misinformation.However, the political bias, if any, is not known apart from the subjective perception that a user has.
We address the question above for articles generated by ChatGPT and Bard in four languages: English, German, Spanish and Catalan.We do this in an automatic and systematic way with almost no human intervention so that the method can be easily extended to new languages and other ILMs with few effort.We do not aim at classifying individual articles with their specific bias, but to classify the media source (an ILM in this case) as Left or Right-oriented in a similar way as the media bias sites do for newspapers and other media outlets.

Corpora Compilation
We approach our task as a classification problem with two classes: Left (L) and Right (R) political orientations.This is a simplification of the real problem, where articles can also be neutral and there might be different degrees of biases.Previous work relied on 3 or 5 classes, always including the neutral option (Baly et al., 2020;Aksenov et al., 2021).In these works, data was manually annotated creating high quality training data but also limiting a lot the scope of the work in terms of languages and countries covered.When using the fine-grained classification scale, the authors acknowledge a bad generalisation of the classifiers to new sources.On the other hand, García-Díaz et al. (2022) and Russo et al. (2023) exclude the neutral class and work with a binary or multiclass Left-Right classifications of tweets from Spanish and Italian politicians respectively, but their work does not include longer texts.The binary classification might be justified as they worked with tweets, a genre where people tend to be more visceral and therefore probably more polarised.In our case, we need to be sure that the classifier generalises well to unseen sources and we stick to the 2-class task while minimising the number of neutral articles in training (see below).
Distant Supervision.As far as we know, only a manually annotated newspaper corpus in English (Baly et al., 2020) and another one in German (Aksenov et al., 2021) are available.We follow a different approach in the spirit of Kulkarni et al. (2018) and Kiesel et al. (2019).We do not manually annotate any article, but we trust All-Sides, MB/FC, Political Watch and Wikipedia (the latter only in cases where the information is not available in the previous sites) with their classification of a newspaper bias.We extract this information for newspapers from USA, Germany, Spain and Catalonia.With the list of newspapers, their URL, 4 and their stance, we use OSCAR, a multilingual corpus obtained by filtering the Common Crawl (Ortiz Suárez et al., 2019;Abadji et al., 2021), to retrieve the articles.Appendix A lists the sources used in this work: 47 USA newspapers with 742,691 articles, 12 German with 143,200,38 Spanish with 301,825 and 19 Catalan with 70,496.Topic Modelling.Not all articles have a bias, some topics are more prone than others.While the Sports section of a newspaper is usually less prone to reflect political biases, the opposite happens with the International section.We therefore use topics to select a subset of relevant training data for our binary classification.We do topic modelling on the articles extracted from OSCAR using Mallet (McCallum, 2002) which applies LDA with Gibbs sampling.We cluster the data in both 10 and 15 groups per language, roughly corresponding to the number of sections a newspaper has.The keywords extracted for each topic are listed in Appendix B. We choose articles that fall under the topics we label as International, Government, Law & Justice, Economy, Live Science/Ecology, and specific language-dependent topics such as Immigration and Violence for English, Nazism for German, and Social for Spanish.The selection is done after the inspection of the keywords.For the final dataset, we do the union of the selected articles clustered to 10 and 15 topics.The process filters out 49% of the Spanish articles, 39% of the German and 31% of the English ones.
Preprocessing and Cleaning.We discard articles with more than 2000 or less than 20 words before cleaning.Afterwards, we remove headers, footers and any boilerplate text detected.This text has the potential to mislead a neural classifier, as it might encourage the classifier to learn to distinguish among newspapers rather than focusing on their political stance.We select a newspaper per language and stance for testing and clean manually their articles.To create a balanced training corpus for each language, we randomly select a similar number of Left and Right-oriented articles from the remaining collection.This balanced dataset is divided into training and validation as shown in Table 1 (top rows).
ChatGPT/Bard Corpus.We create a multilingual dataset with 101 articles.For this, we define English (USA)  We prompt ChatGPT (GPT-3.5-Turbo)five times using the same subjects across four time periods.We generate the dataset with ChatGPT versions of Feb 13 (v02), Mar 23 (v03), May 24 (v05) and Aug 3 (v08); we cover the 4 languages simultaneously only with the last two.ChatGPTv05 generates significantly longer texts than the other ones with an article-oriented structure with slots to be filled with the name of the author, date and/or city.Multilingual Bard was available later, and we prompt it twice during the same period as ChatGPTv8. 6 Table 1 shows the statistics for this corpus.

Political Stance Classification
The Network.We finetune XLM-RoBERTa large (Conneau et al., 2020), a multilingual transformer-based masked LM trained on 100 languages including the 4 we consider.The details 5 More specific prompts did not lead to different styles for the first versions of ChatGPT, for the last one we added more information such as ...without subheaders.to avoid excesive subsectioning and/or bullet points.Neither ChatGPT nor Bard did always follow properly the instruction.The dataset we provide includes the prompts we used.
6 Prompted 14-21 August 2023 from Berlin for English and German and from Barcelona for Spanish and Catalan as, contrary to ChatGPT, the generation depends on the location.
of the network and the hyperparameter exploration per model are reported in Appendix F.
The Models.We train 4 models: 3 monolingual finetunings with the English, German and Spanish data, plus a multilingual one with the shuffled concatenation of the data.All models are based on multilingual embeddings (RoBERTa) finetuned either monolingually or multilingually.Notice that we do not train any model for Catalan.With this, we want to compare the performance of mono-and multilingual finetunings and explore the possibility of using multilingual models for zero-shot language transfer.
Coarse Classification with Newspaper Articles.Table 2 summarises the results.All the models achieve more than 95% accuracy on the validation set which is extracted from the same distribution as the training data.In order to see how the models behave with unseen data, we calculate the percentage of articles that are classified as Left (L) and Right (R) in the test newspapers of Table 1.We perform bootstrap resampling of the test sets with 1000 bootstraps to obtain confidence intervals at 95% level.We do not expect all the articles of a newspaper leaning towards the Left to show clear characteristics of the Left, but given that there is no neutral class, we expect the majority of them to be classified as Left.A good result is not necessarily 100%-0%, as this would not be realistic either.We consider that a newspaper has been classified as having a Left/Right political stance if more than 50% of its articles have been classified as such.These cases are boldfaced in Table 2.This is the behaviour we obtain for all the test newspapers but for the German Right-oriented newspaper: die Preußische Allgemeine Zeitung (PAZ).The German model is trained only on 12 Val. Acc (%) 97.9 96.9 99.2 96.9 95.9 96.9 -Classification (% of articles per stance) and PAZ as a Right-oriented one (R: 58±5%).We again attribute the difference to the German model being trained on a corpus lacking diversity.When we use the multilingual system, the dominant factor distinguishing the outputs is the language itself rather than the stance.The addition of English data is insufficient to alter the classification significantly.
When we use the English system, the language does not play a role any more and only the stance features are considered.When we apply the English model to the Catalan newspapers we do not obtain satisfactory results though (95±1% for the Left but 16±3% for the Right newspaper) showing that the relatedness across languages is important.The multilingual model however properly detects the stance of the Catalan newspapers probably because it has been trained with an heterogeneous corpus that includes a related language (Spanish).We are able to perform zero-shot language transfer classification when we deal with close related languages.
Coarse Classification with ILM-generated Articles.The bottom part of Table 2 details the results.
We first focus on the English and Spanish models as the German one did not properly classify our test newspapers.The most relevant aspect to notice in ChatGPT is the strong change in political stance between February (v02) and May (v05) followed by a movement towards neutrality in August (v08).We checked that this polarity change is not an effect of the length of the outputs -the major shallow change in the generated articles.The training data in English has 5,730 L-6,988 R articles with 584<length(words)<624 (similar to ChatG-PTv05 length) and 4,563 L-7,127 R articles with 331<length<371 (similar to ChatGPTv02).In both cases the number of articles is larger for the Right stance, but the prediction for ChatGPTv02 clearly points towards the Left, rejecting the hypothesis that length plays a role in the classification.A similar thing happens for Spanish.According to our models, the May 24th version of ChatGPT in English and Spanish would have an editorial line close to the Right ideology, which differs from the ideology of the previous versions.Notably, this period corresponds to the time when ChatGPT experienced a performance drop in several tasks according to Chen et al. (2023).The German and Catalan outputs would still show an imprint from the Left ideology also in v05 but more diverse training data would be needed to confirm this with our monolingual models.It is interesting to notice that if we use the English monolingual model for German and Catalan, we still get the Left imprint (60±10% for German and 87±7% for Catalan).So we have indications that the political stance of ChatGPT depends on the language, which is not surprising in a data-driven system.The last version, ChatGPTv08, produces the most neutral texts, with only German clearly leaning towards the Left.The two generations, v08a and v08b, show that results are robust and are not tied to a particular generation.
There is only a version available for multilingual Bard that covers our time frame. 7The variation between generations is larger for Bard than for ChatGPT but, comparing v08 versions, Bard points towards the Left in a more consistent way across languages.Bard's political orientation can also be determined by its answers to political test or quiz questions.The Political Compass (PC) site8 defines 62 propositions to identify the political ideology -with an European/Western view-in two axes: economic policy (Left-Right) and social policy (Authoritarian-Libertarian), both in the range [-10,10].Each proposition is followed by 4 alternatives: strongly agree, agree, disagree and strongly disagree.When prompted with the questionnaire,9 Bard's scores are (-6.50, -4.77) for English, (-8.00, -7.13) for German, (-5.75, -4.15) for Spanish and (-6.75, -4.56) for Catalan, where the first number corresponds to the economic policy and the second to the social policy.The results are in concordance with Table 2 and give an indirect validation of our method which does not rely on direct questions. 10his kind of analysis is not possible with Chat-GPT any more as it refrains from expressing opinions and preferences, demonstrating the relevance of an approach that detects the leaning in a more indirect way.Also notice that these questionnaires are well-known and public, so it would be easy to instruct a LM to avoid the questions or react to its propositions in a neutral manner.Previous work used only political tests and questionnaires to estimate ChatGPT's orientation.Hartmann et al. (2023)

Summary and Conclusions
Media sources have an editorial line and an associated bias.Getting rid of political biases is difficult for humans, but being aware of them helps us getting a global view of news.Biases are sometimes clear and/or appear in form of harmful text, but sometimes are subtle and difficult to detect.These subtle hidden biases are potentially dangerous and lead to manipulation whenever we are not aware of them.In this work, we systematically studied the subtle political biases behind ChatGPT and Bard, those that appear without assigning any persona role (Deshpande et al., 2023).We showed that ChatGPT's orientation changes with time and it is different across languages.Between Feb and Aug 2023, ChatGPT transitioned from a Left to Neutral political orientation, with a Right-leaning period in the middle for English and Spanish.The evolution for Bard cannot be studied yet.Its current version as of Aug 2023 consistently shows Left-leaning for the 4 languages under study.This bias is independent on the factual mistakes that the model generates, and should also be considered by its users.We provide models to regularly check the bias in text generations for USA, Germany and Spain, as well as in closely related political contexts and languages using a zero-shot approach.As a by-product of our analysis, we created a multilingual corpus of 1.2M newspaper articles with coarse annotations of political stance and topic.We show that distant supervision allows us to build meaningful models for coarse political stance classification as long as the corpus is diverse.We make available this data together with the LMs generations and our code through Zenodo (España-Bonet, 2023) and Github. 12

Limitations
We are assuming that All media sources have an editorial line and an associated bias, and we treat the ILM as any other media source.We do not consider the possibility of a ChatGPT or Bard article being unbiased.This is related to the distant supervision method used to gather the data that currently allows for a binary political stance annotation.Since manually annotating hundreds of thousands of articles with political biases in a truly multilingual setting seems not possible in the foreseeable future, we decided to implement a completely data-based method and study its language and culture transfer capabilities.
Using distant supervision for detecting the political stance at article level is a delicate topic though.First, because the same newspaper can change ideology over time.Second, and this is more related to the content of an individual article, non-controversial subjects might not have a bias.Even in cases where bias exists, there is a spectrum ranging from the extreme Left to the extreme Right, rather than a clear-cut division between the two ideologies.
In order to quantify and if possible mitigate the current limitations, we plan to conduct a stylistic analysis of the human-annotated corpora (Baly et al., 2020;Aksenov et al., 2021) and compare it to our semi-automatically annotated corpus.As a follow-up of this work, we will perform a stylistic analysis of the ILM-generated texts too as a similar style between the training data and these texts is needed to ensure good generalisation and transfer capabilities.

C Distribution of Topics per Newspaper
Continued on next page

F Training Details
F.1 L/R Classifier We finetune XLM-RoBERTa large (Conneau et al., 2020) for L vs. R classification as schematised in Figure 1.Our classifier is a small network on top of RoBERTa that first performs dropout with probability 0.1 on RoBERTa's [CLS] token, followed by a linear layer and a tanh.We pass trough another dropout layer with probability 0.1 and a final linear layer projects into the two classes.The whole architecture is finetuned.
Newspaper article We use a cross-entropy loss, AdamW optimiser and a learning rate that decreases linearly.We tune the batch size, the learning rate, warmup period and the number of epochs.The best values per language and model are summarised in Table 12 97.999.2 95.9 96.9

XLM-RoBERTa
Table 12: Main hyperparameters used and their performance in the three monolingual finetunings (en, de and es) and the multilingual one (en+de+es).
All trainings are performed using a single NVIDIA Tesla V100 Volta GPU with 32GB.

F.2 Topic Modelling
We use Mallet (McCallum, 2002) to perform LDA on the corpus after removing the stopwords, with the hyperparameter optimization option activated and done every 10 iterations.Other parameters are the defaults.We do a run per language with 10 topics and another run with 15 topics.We tag the corpus with both labels.
Number of articles per newspaper (row) and topic (column) for the German subset of OSCAR.See Table5for the definition of the topics.Topics boldfaced and in blue are used for training the classifier after balancing L Number of articles per newspaper (row) and topic (column) for the English subset of OSCAR.See Table4for the definition of the topics.Topics boldfaced and in blue are used for training the classifier after balancing L

Table 1 :
Number of articles (average word count in parentheses) divided as articles belonging to a newspaper with a Left (L) and Right orientation (R).For testing, we use newspapers not seen in training or validation: Slate (L) and The National Pulse (R) for USA, My Heimat (L) and die Preußische Allgemeine Zeitung (R) for Germany, Mundo Obrero (L) and El Diestro (R) for Spain and Vilaweb (L) and Diari de Tarragona (R) for Catalonia.
101 subjects including housing prices, abortion, tobacco, Barak Obama, etc. and translate them manually into the 4 languages (see Appendix D).The subjects consider topics prone to have a political stance such as those related to feminism, capitalism, ecologism, technology, etc.We also include proper names of people in the 4 countries being considered, whose biography may differ depending on the political stance of the writer.These subjects are inserted into the template prompt (and its translations into German, Spanish and Catalan): 5 Write a newspaper article on [SUBJECT] en

Table 2
used PC, 38 political statements from the voting advice application Wahl-O-Mat (Germany) and 30 from StemWijzer (the Netherlands) to conclude that ChatGPT's ideology in its version of Dec 15 2022 was pro-environmental and left-libertarian.A study conducted by the Manhattan Institute for Policy Research 11 reported that ChatGPT tended to give responses typical of Left-of-center political viewpoints for English (Rozado, 2023).The authors administered 15 political orientation tests to the ChatGPT version of Jan 9. Their results are consistent with our evaluation of the Feb 13 model.Finally, Motoki et al. (2023) performed a battery of tests based on PC to show that ChatGPT is strongly biased towards the Left.The authors do not state the version they use, but the work was submitted on March 2023.All these results are therefore before the move to the Right we detected in May.

Table 4 :
money government billion economic economy pay financial years federal jobs workers people spending market companies business prices bank u.s budget insurance rate plan debt health growth costs 15:10 0.08 ecologism water climate food years energy space people city year oil change time environmental air gas area power national natural miles day land scientists small science local emissions carbon earth weather 15:11 0.05 international II u.s military war israel iran iraq government security forces afghanistan united syria attack people president israeli islamic attacks killed american troops nuclear army obama country muslim group terrorist officials weapons 15:12 0.03 sports game team season games players sports football play time win coach league year nfl points player won teams left field week good fans yards sunday night final big college played 15:13 0.09 law & justice court law federal rights case supreme legal abortion justice public government judge decision laws action u.s amendment department order attorney lawsuit news ban cases ruling issue general filed texas policy 15:14 0.07 technology company companies business facebook technology market data online internet google media users products digital work tech people service industry news site time social content customers ceo free apple firm year Topics (with 10 and 15 clusters) obtained with Mallet on OSCAR's English newspaper documents.Clusters boldfaced and colored in blue are used to build the training data.bild welt musik sehen erzählt frauen mutter kunst geschichte berlin paar vater weiß familie liebe steht künstler zeigt bühne kinder leute spielt bilder eher publikum männer band 10:1 0.15 hotchpotch welt deutschen deutschland gesellschaft buch leben geschichte deutsche beitrag frage frauen politik junge politischen politische berlin medien kirche juden thema wissen krieg freiheit steht eher leute kultur staat arbeit sprache 10:2 0.11 law & justice water food people city years space year air time area day climate local national miles travel flight land small island oil south park coast scientists north long sea california change 15:0 0.09 education school people students women children health university work study time education parents schools life kids percent college years care medical child high family student young cancer good feel day patients 15:1 0.04 immigration immigration border u.s people immigrants mexico government illegal country united president american marijuana drug migrants children year years security work democracy administration america news mexican texas refugees legal today enforcement 15:2 0.07 government I trump news president u.s intelligence media report house investigation security fbi department government white intercept campaign officials committee clinton told story national trump's justice

Table 5 :
Topics (with 10 and 15 clusters) obtained with Mallet on OSCAR's German newspaper documents.Clusters colored in blue are used to build the training data.

Table 6 :
Topics (with 10 and 15 clusters) obtained with Mallet on OSCAR's Spanish newspaper documents.Clusters colored in blue are used to build the training data.

Table 9 :
Number of articles per newspaper (row) and topic (column) for the Spanish subset of OSCAR.See Table6for the definition of the topics.Topics boldfaced and in blue are used for training the classifier after balancing L E Stance Classification at Article Level .