Jingfeng Yang


Jingfeng Yang

WHY-WHAT-HOW Questions Regarding AI Safety

I started writing this blog post on May 5th, 2023, and finished it on May 8th, 2023.

This post may contain several toxic, triggering, and offensive statements or examples. All the opinions are my own, not reflecting the view of my affiliation or any others. I would thank Hongye Jin and Jie Huang for their advice on the initial draft of this article.

Recently, AI safety has attracted more and more attention. To be honest, I was not aware of the importance of AI safety until one year ago, and have not shown much interest in it before that, since I started my NLP research 6 years ago. But I should have acknowledged that, the more powerful AI technology is, the more severe AI safety issues are. With advances of ChatGPT and GPT4, we have reached the point where AI safety should be put into the first place, whenever we want to build a modern AI system with advanced AI technology. Ultimately, AI should help people instead of replacing people, or doing harm to our society. In this article, I would introduce why AI safety is important, what are specific issues of AI safety we should consider, and how to pursue AI safety.

Note that many contents of this blog are summarized or even excerpted from System Card of GPT4 Technical Report, Sam Bowman’s paper “Eight things to know about large language models”, and Jacob Steinhardt’s blog “Emergent Deception and Emergent Optimization”. I’m roughly following the safety categorization of GPT4 Technical Report as follows, but also trying to make it more clear.Safety

Why is AI Safety important?

From the perspective of OpenAI, they explained why they did not disclose more technical details in the GPT4 technical report. They said “Given both the competitive landscape and safety implications of large-scale models like GPT-4, this report contains no further details about the architecture, hardware, training compute, dataset construction, training method or similar.” Many people might have overestimated the commercial reason, but underestimated the safety reason. With a super powerful model released, but without OpenAI’s several years’ efforts on safety alignment (including at least 6 months of efforts on GPT-4 safety alignment after finishing training it), there could be numerous and horrible safety issues. Actually, the recent boom of various open-source LLMs and their ChatBot versions has already demonstrated the overlook of the safety alignment, which could raise many problems.

From my personal point of view, I would explain the importance of AI Safety through the perspective of human-centered AI. Although my academic advisor was a pioneer in socially responsible NLP and human-centered AI, I have not devoted much effort to that research direction since my collaboration with her in 2019, because I thought the language technology (e.g. NLP) was not advanced enough at that time for us to put the safety issues to the first place. Only after coming to industry, I realized the importance of AI safety when deploying any AI systems to real-world production use cases. Touching the real-world users more and more frequently, I can be aware of the most critical problem of current AI systems.

I can use one example to show the importance of human-centered AI. Many people in academia, including me, focused too much on academic benchmarks, instead of trying to deal with in-the-wild user data and getting real-world user feedback. OpenAI was definitely a pioneer in this direction. They have stopped caring about the superior performance on any specific NLP benchmark datasets, since GPT3 was released in 2020. Instead, they delivered the GPT3 API and playground to users three years ago, starting to collect real-world user data, and optimize real-world user experiences. Three years of user input instructions, annotated responses and feedback are one of the most valuable assets of OpenAI, making up the barrier to entrance for other companies, even tech giants, who want to enter the game. When OpenAI had been using such data to do human intent and value alignment, other NLP researchers, including those in the most advanced academic and industrial research labs were still optimizing performance on specific NLP tasks, and conducting NLP task instruction tuning till the end of last year. That’s the key difference of mindset between OpenAI and others, which set its leading position. So it’s time to transform from model-centered AI to data-centered AI, further to human/user-centered AI.

If we are considering human-centered AI seriously and carefully, we could realize the importance of socially responsible AI, and more generally, AI safety. Because any unsafe AI behaviors would do harm to humans or the society, which is against the value of human-centered AI.

What are critical issues regarding AI Safety?

In this section, I would show some historical safety issues with less powerful models, urgent safety issues brought by models as strong as ChatGPT or GPT4, and some arising and potential issues with even more powerful models.

Historical Safety Issues

Before ChatGPT came out, there were several safety issues that had been explored in text generation, or more broadly, NLP. In this section, I would briefly introduce some of them.

  1. Hallucination is a phenomenal issue in text generation [Tian, et al., 2019], which was much more severe with LSTM, un-pretrained Transformer, and smaller pretrained models (e.g. BART, T5, GPT2 etc.).
  2. Harmful contents is another common safety issue in text generation systems. Some harmful contents could violate policy or pose harm to individuals, groups, or society, including hate speech [ElSherief et al., 2021], discrimination, toxicity and so on.
  3. Biases [Pryzant et al., 2020. Zhong et al. 2021] and stereotypes are typical harms of representation, which could be an issue with severe negative societal impact.
  4. Multilingual [Hu et al., 2020] and multi-dialectal [Ziems et al. 2022] disparity of models’ performance is the core motivation to drive multilingual and multi-dialectal NLP research.
  5. Some NLP systems have tried to analyze content that is misleading but persuasive [Yang et al. 2019].
  6. Text style transfer systems could help change narrative on a topic and influence people [Ziems et al. 2022].
  7. Privacy protection has been a hot topic since people found it is possible to extract training data from LLMs [Carlini et al. 2021], while some findings indicate that LLMs might not be good at associating the personal information with its owner [Huang et al. 2022].
  8. Wide adoption of text-to-code generation systems also poses new threats on cybersecurity. For example, some people have found it easy to conduct SQL injection to language model based SQL generation systems [Peng et al. 2022].

Urgent Safety Issues with Advanced LLMs

Most of the safety issues in NLP mentioned above still remain or become even more significant with stronger performance of ChatGPT/GPT-4 and wider adoption of such generation systems. Meanwhile, more safety issues occur with the advances of LLMs. In this section, I will mainly talk about those urgent and practical safety issues we have seen recently, especially with the boom of various unsafe open-sourced LLMs and ChatBots. Also, I would show some examples from those chat systems (including OpenAssistant, Vicuna and Koala) without enough safety alignment.

  1. Although hallucination has been alleviated with more powerful models, it is still an issue that has not been solved. Counterintuitively, hallucination is even more harmful, as models become more convincing, believable, and truthful. Because that makes it trusted by users and hard to distinguish.Hallucination
  2. Strong generation models also make it possible to generate more types of reasonable harmful contents, including 1) advice for self-harm behaviors, 2) erotic or violent contents, 3) harassing, demeaning and hateful contents, 4) contents useful for planning attacks or violence, 5) instructions for finding illegal goods or services.
    Note that whether generated contents are harmful could also depend on the context of usage. More informational generated texts are more likely to be used for a harmful purpose, like incitements to violence, contents to be used to spread false narratives or exploit an individual, and so on.Harmful
  3. It’s common that an LLM trained on real world corpus which contains biases could reinforce biases and its worldviews, including harmful stereotypical and demeaning associations for certain marginalized groups. Interestingly, LLM-based chatbots are inclined to have hedging behaviors, which can sometimes even exacerbate stereotyping or demeaning harms.Bias
  4. LLMs are more powerful at generating realistic, persuasive and targeted disinformation that is intended to mislead, or even achieve influence operation. Sometimes, they could rival human propagandists by producing plausible-seeming plans for achieving a propagandists objective. They can produce even more realistic messaging, when given personalized information about a target. Also, they are able to generate discriminatory content favorable to autocratic governments, following the lead of users, by picking up subtle indicators in the prompt.Misleading
  5. A powerful LLM could have more safety issues regarding privacy. It can easily reveal publicly available personal information, like those people who have a significant presence on the public internet, such as celebrities and public figures. And it seems better at association of private information. For example, it could complete multiple basic tasks that may relate to personal and geographic information, such as determining the geographic locations associated with a phone number or answering where an educational institution is located in one completion and without browsing the internet. Even an LLM is trained on public data only, it may disseminate some information unexpectedly.Privacy
  6. Advances of LLMs have caused some double-edged economic impacts. On the one hand, they have augmented human workers, including upskilling in call centers, helping with writing, serving as coding assistants and so on, which could match candidates to jobs and improve overall job satisfaction. On the other hand, LLMs have caused workforce deduction because of its replacing some human work, and could have an even more significant impact on the labor market [Eloundou et al., 2023].
  7. There has already been an acceleration of LLM development after ChatGPT and GPT4 was released. The racing dynamics has led to a decline in safety standards, the diffusion of bad norms, and accelerated AI timelines. Such acceleration is contributed by tech giants, startups, and the open-source community. A notable phenomenon is the increase in demand for competitor products in other countries, which has caused some international impacts. For example, more than 10 Chinese companies or institutes have released their ChatGPT-style products since the beginning of this year. A more concerning issue is that some companies have “stolen”/“self-instructed” data from ChatGPT/GPT4 to train their chatbots, which is illegal without proper commercial licenses. But no regulation has been put on to that, and it is difficult to apply regulations, especially for those companies in other countries.

Potentially Severe Safety Issues in the Future

  1. LLMs could bring allocative harms when used to make decisions or inform decisions around allocation of opportunities or resources, like high risk decision making (law enforcement, criminal justice, migration and asylum) and offering legal or health advice.
    Wider usage of LLMs could amplify its disparities in quality of service due to decreased performance for speakers of some languages or dialects.
  2. More convincing LLMs (e.g GPT4) could shape society’s epistemic views with disinformation. Some behaviors of LLMs could even enhance that. For example, people have widely observed the sycophancy behavior of LLMs, where a model answers subjective questions in a way that flatters their user’s stated beliefs, and sandbagging behavior of LLMs, where models are more likely to endorse common misconceptions when their user appears to be less educated [Bowman et al., 2023]. With ubiquitous presence of generated contents due to increasing adoption of generation models, degradation of overall information quality in society reduces veracity of and trust in freely available information, casting doubts on the whole information environment, and threatening our ability to distinguish fact from fiction, benefiting Liars Dividend.
  3. LLMs could potentially contribute to proliferation of weapons. Although required information is just one ingredient, easier access to such information could help those people like threat actors. Because time to research is reduced by prompting GPT4, compared with using traditional search tools. Generally, it is a common consequence of LLMs to generate publicly accessible but difficult-to-find information that is understandable to non-expert users. Here is an example of asking OpenAssistant “how to make nuclear weapons?”:Weapon
  4. LLMs bring some new cybersecurity issues. For example, LLM-based code generation systems could be leveraged to achieve DoS attack, data theft and so on [Peng et al.., 2022]. Although LLMs themselves are not upgrades to current Social Engineering (e.g. target identification, spearphishing, and bait-and-switch phishing) capabilities because of non-factuality, it could help draft realistic Social Engineering contents when provided with knowledge about targets (e.g. email phishing workflow).
  5. Although early versions of GPT4 are still ineffective at autonomously replicating, acquiring resources, and avoiding being shut down “in the wild”, LLMs actually show some emergent capability of power-seeking, creating and acting on long-terms plans, and agentic behavior (e.g. accomplish goals which may not have been concretely specified and which have not appeared in training; focus on achieving specific, quantifiable objectives; and do long-term planning). For example, GPT4 could pretend to be a human to get a TaskRabbit worker to solve CAPTCHA. This could be partially explained from the perspective of “emergent optimization” [Steinhardt et al., 2023]. First, RLHF could help LLMs choose from a diverse space of actions based on their long-term consequences. Besides, during pretraining, next-token predictors can even learn to plan [Steinhardt et al., 2023]. To find more evidence of this point, one can refer to the LangChain project.
  6. LLMs can bring more risks in real-world contexts through interactions with other systems or humans, because they have already appeared to learn and use representations of the outside world [Bowman et al., 2023]. For example, with literature and embedding tools, GPT4 could search for chemical compounds that are similar to other chemical compounds, propose alternatives that are purchasable in a commercial catalog, and execute the purchase. One can look for more examples from the open-sourced AutoGPT project (e.g. automatically growing twitter followers). Interactions with humans could bring even more risky societal effects. Algorithmic collusion could be one consequence after high-impact decision-makers rely on decision assistance from AI models whose outputs are correlated or interact in complex ways. AI could also possibly achieve manipulation of humans in the loop (e.g. polarization of users of recommender systems). The potential “emergent deception” can be another example. In such cases, an LLM could fool or manipulate the supervisor rather than doing the desired task (e.g. of providing true and relevant answers), because doing so gets better (or equal) reward [Steinhardt et al., 2023].
  7. Better AI could bring imbalanced economic impacts. For example, differential access and benefits from access to new tools and applications results in workforce displacement in some areas (e.g. legal services) and decline of wages for certain jobs given the competitive cost of the model. It could also bring changes in industrial organization and power structures due to collection of and access to training data. Specifically, existing social networks, technical infrastructure, and linguistic and cultural representation will play a role in who gets access and benefits from access, which could bring economic harms to certain groups. Also, data and models controlled by certain tech giants could be detrimental to other companies and the whole A(G)I community. Besides, infrequently updated models could entrench existing players and firms. For example, the models would always generate the same answer with temperature 0 for the following question: “What is the best bagel place in New York?”EconomicHowever, with quick learning and proper usage of advanced AI systems, it potentially enables more personalized and efficient services and creates new opportunities for job seekers.
  8. Overreliance on LLMs, although having not been widely observed, has become a trend in some areas. Such overreliance could come from people’s general trust in the model, as well as GPT4’s authoritative tone and generated context of highly detailed information. Interestingly, models’ hedging behaviors could also be an epistemic humility that fosters users’ overreliance. There are several possible negative consequences of overreliance, as listed in GPT4 paper: “users may not be vigilant for errors due to trust in the model; they may fail to provide appropriate oversight based on the use case and context; or they may utilize the model in domains where they lack expertise, making it difficult to identify mistakes. As users become more comfortable with the system, dependency on the model may hinder the development of new skills or even lead to the loss of important skills.”

How to Pursue AI Safety

To alleviate safety issues, one should first be aware of all the possible safety issues. Only then, it’s possible to alleviate them respectively.

However, finding new safety issues is not that easy sometimes. That’s why we focused much on various safety issues in the previous section. Also, it definitely requires real-world users and domain experts to red team an LLM, identifying more potential safety issues and creating more prompts that could elicit safety issues.

Then, several techniques should be applied to alleviate the safety issues whenever an LLM is going to be deployed in production. I summarized what has been mentioned in GPT4 technical reports into the following points.

  1. From the perspective of model mitigation, there are three general ways.
    1. During the LLM pretraining stage, those pretraining data which could potentially trigger safety issues should be filtered out.
    2. During the finetuning stage, one could use SFT or RLHF for safety alignment. The model could be aligned on diverse user instructions to be harmless.
    3. Alternatively, Learning from AI Feedback (LAIF) has also been demonstrated as useful. GPT4 used rule-based reward models (RBRM) for this purpose, while earlier work “Constitutional AI” [Bai et al., 2022] was among the very first to leverage AI criticism and revision, as well as RLAIF. This has been found helpful to learn a harmless but non-evasive (still helpful) assistant, considering that there has typically been a tradeoff between harmlessness and helpfulness alignment. (Say, if an AI assistant would always give “I could not answer it” as the response to any questions, it is totally harmless but not helpful at all. As an example, Anthropic’s safety-first language model could be too safe to be useful sometimes.) Besides, RLAIF makes it possible that human performance on a task isn’t an upper bound on LLM performance [Bowman et al., 2023].
  2. For any version of LLMs, we definitely need one way to evaluate their safety. One reliable and must-to-do way is domain expert red teaming, which could be costful. We also need some evaluation set and automatic evaluation models to evaluate frequently and choose among various versions of models. Sometimes, the evaluation model could be the underlying generative model itself (e.g. GPT4).
  3. When putting any models to production, we need to carefully design thorough user policies to avoid any use cases that could potentially cause safety issues. Meanwhile, online monitoring of the model usage is required. Some monitoring models could help classify and refuse to respond to unsafe prompts. Human monitoring is even needed sometimes.
  4. Ideally, a moderation classifier or API could also be provided to users for flexible usage, instead of just using it for internal monitoring.

In summary, any companies or individuals should pay more attention to AI safety when developing any LLMs or even A(G)Is. More safety issues should be identified with the help of more users, and should be avoided through several action items. Even so, there are more intended ways of triggering AI safety issues of LLMs (e.g. jailbreaking prompting). Thus, without enough efforts in the AI safety direction, huge progress in AI may cause severe consequences to humans and society.


  1. Tian, Ran, et al. “Sticking to the facts: Confident decoding for faithful data-to-text generation.” arXiv preprint arXiv:1910.08684 (2019).
  2. ElSherief, Mai, et al. “Latent hatred: A benchmark for understanding implicit hate speech.” arXiv preprint arXiv:2109.05322 (2021).
  3. Pryzant, Reid, et al. “Automatically neutralizing subjective bias in text.” Proceedings of the aaai conference on artificial intelligence. Vol. 34. No. 01. 2020.
  4. Zhong, Yang, et al. WIKIBIAS: Detecting Multi-Span Subjective Biases in Language[D]. EMNLP 2021
  5. Hu J, Ruder S, Siddhant A, et al. Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation[C]. ICML 2020.
  6. Ziems C, Held W, Yang J, et al. Multi-VALUE: A Framework for Cross-Dialectal English NLP[J]. arXiv preprint arXiv:2212.08011, 2022.
  7. Yang D, Chen J, Yang Z, et al. Let’s make your request more persuasive: Modeling persuasive strategies via semi-supervised neural nets on crowdfunding platforms[C]. NAACL 2019.
  8. Ziems C, Li M, Zhang A, et al. Inducing Positive Perspectives with Text Reframing[J]. ACL 2022. ​9. ​Carlini N, Tramer F, Wallace E, et al. Extracting Training Data from Large Language Models[C]. USENIX Security Symposium. 2021, 6.
  9. Huang J, Shao H, Chang K C C. Are Large Pre-Trained Language Models Leaking Your Personal Information?[J]. arXiv preprint arXiv:2205.12628, 2022.
  10. Peng X, Zhang Y, Yang J, et al. On the Security Vulnerabilities of Text-to-SQL Models[J]. arXiv preprint arXiv:2211.15363, 2022.
  11. Eloundou T, Manning S, Mishkin P, et al. Gpts are gpts: An early look at the labor market impact potential of large language models[J]. arXiv preprint arXiv:2303.10130, 2023.
  12. Bai Y, Kadavath S, Kundu S, et al. Constitutional AI: Harmlessness from AI Feedback[J]. arXiv preprint arXiv:2212.08073, 2022.
  13. Bowman S R. Eight things to know about large language models[J]. arXiv preprint arXiv:2304.00612, 2023.
  14. Emergent Deception and Emergent Optimization. Jacob Steinhardt 2023. https://bounded-regret.ghost.io/emergent-deception-optimization/
  15. AutoGPT: https://github.com/Significant-Gravitas/Auto-GPT
  16. LangChain: https://github.com/hwchase17/langchain