AI Safety, Governance, and Alignment Tutorial
Saturday, May 20, 1:15 – 2:45 pm

Facilitators: Dr. Sherri Lynn Conklin* & Gaurav Sett* (*Georgia Institute of Technology)

Topic Overview: The field of AI safety, governance, and alignment (SGA) is concerned with questions about how to integrate AI with human values. Topics in this area deal with (1) the transformative capabilities of AI within the global innovation helix; (2) the difficulty of controlling AI, esp. with regard to designing & setting goals, and interpreting & explaining AI behavior; and (3) the difficulty of governing AI, esp. with regard to establishing policies that ensure we develop beneficial AI and implementing practices across the broad domains of the global innovation helix to prevent catastrophic outcomes for humanity. With this in mind, one of the greatest tools that we have for obviating any risk to humanity is to educate technologists so that they can implement strategies during the design and testing phases of AI, prior to release.

Tutorial Overview: We propose a 90 min. discussion-focused, workshop-tutorial with hands-on components on SGA, which falls into the general category of Innovation and Ethics Education. This tutorial has three parts.

Part 1: The Transformative Capabilities of AI within the Global Innovation Helix [25 min.]
Objectives: Attendees will be able frame the scope of the SGA problem, including the rapid development of AI, the transformative capabilities of AI, and the ethical issues that AI will likely create for humanity. They will also be able to articulate the ethical obligations, concerning SGA, that should guide individuals within and across different sectors of the global innovation helix.

Summary: AI progress will be rapid. Talking points will cover, for example, how, on aggregate, AI/ML researchers place a 50% probability of human-level machine intelligence by 2060.1 Increased compute has led to increased performance, and, while compute cost has exponentially decreased, investment in AI has drastically increased. Furthermore, the biological anchors model estimates 50% probability that we will have human brain scale computation power by 2060.2 As humanity develops increasingly sophisticated AI, the issue of SGA will present as one of the most important socio-ethical concerns of the next century – one that will most likely remain central to global innovation helix research and policy agendas as AI change, evolve, and take on increased responsibilities across public and private social domains. Despite the significance of SGA to the future of socially responsible innovation, SGA is an oft neglected field – regularly thought to serve as a barrier to technological innovation.

Part 2: The Difficulty of Controlling AI [40 min.]
Objectives: Small groups will examine curious examples of documented AI misalignment.3 This presents attendees with hands-on opportunities to study and describe misalignment issues and to consider ethical concerns that arise from the behaviors of misaligned AI. Attendees will develop strategies for preventing misalignment from the standpoint of product and design requirements.

Summary: AI will be difficult to control, but the consequences of uncontrollable AI could be catastrophic. Afterall, intelligent agency is the most powerful force on this planet. Because of our supreme intellect, humans have transformed the environment, bred & slaughtered billions of animals, and developed smarter systems (AI). We should not release such a force absent strong control. However, it is hard to design AI systems that integrate human goals and values.
Our talking points include, for example, how an agent’s goals are affected by its intelligent abilities. Humans illustrate that general intelligence does not guarantee common purpose.4 In contrast, humans illustrate that general intelligence guarantees common instrumental goals such as self-preservation, incorrigibility, resource acquisition, and self-enhancement.5 Due to inevitable problems with AI misalignment, advanced AI will develop many goals and behaviors that are unintended by humans. Inner alignment problems that could arise when humans poorly specify AI goals. In some cases, AI develop gaming behavior comparable to that of children who take instructions for performing tasks literally.6 They fail to share human objectives and instead follow only the instructions specified for achieving the objective (e.g., an AI trained to follow dots in order to complete a maze might develop the objective of dot following and not the objective of completing the maze).7 Even if we specify the right objective, AI may not understand the objective or may deceive us into believing it has internalized it,8 and, unfortunately, it is hard to explain AI behavior and they are often considered black boxes.9 Moreover, AI capabilities can rapidly emerge as model architecture develops and these are hard to predict.10 AI development is outpacing most approaches to explainable AI.
Given the rapid pace at which AI is developing and the potential impact of uncontrolled AI on humanity, it seems like the community of technology and ethics practitioners and theoreticians across the broad domains of the global innovation helix have the imperative to develop strategies for controlling AI.

Part 3: The Difficulty of Governing AI [25 min.]
Objectives: Attendees will be able frame the scope of AI governance issues, which will include (but are not limited to) issues concerning weaponization, enfeeblement, propaganda and eroded epistemics, proxy gaming, and value lock-in. Attendees will propose other areas of governance concerns and articulate the ways in which such considerations bear on their work. Attendees will propose policies and other solutions that might support effective AI governance.

Summary: Even if we have the ability to control AI, we must ensure it is used beneficially, but governing AI will be difficult.11 Talking points include problems such as: Weaponization, where governments are strongly incentivized to weaponize AI, which would significantly increase the risks of conflict;12 Enfeeblement, where important decisions may be handed off to AI, endangering humanity’s capacity for self-governance. (This scenario was depicted in the film WALL-E);13 Eroded epistemics, where nations, political parties, and many other actors are strongly incentivized to develop agents that spread propaganda, undermining our ability to seek truth;14 Proxy gaming, where AI may strongly shape human behavior in suboptimal ways, illustrated by addiction caused by social media recommendation algorithms;15 and Value lock-in, where advanced AI lock-in the dominance of the nations or companies that develop it, curtailing capacity for social progress.16 We will discuss which practices should be implemented to ensure the development of beneficial AI.

1 B. Zhang, N. Dreksler, M. Anderljung, L. Kahn, C. Giattino, A. Dafoe, & M. C. Horowitz, “Forecasting AI progress:
Evidence from a survey of machine learning researchers,” 2022, arXiv preprint arXiv:2206.04132.
2 A. Cotra, “Forecasting TAI with biological anchors,” unpublished; A. Cotra, “Draft report on AI timelines,”, September 2020.
(accessed Nov. 18, 2022).
3 J. Shane, “AI Weirdness,”, 2022. (accessed Nov. 18, 2022)
4 N. Bostrom, “The superintelligent will: Motivation and instrumental rationality in advanced artificial agents,” Minds
and Machines, vol. 22, no. 2, May 2012.
5 ibid.
6 V. Gabulaitė, “103 kids who take instructions too literally,”, 2017. (accessed Nov. 18, 2022).
7 V. Krakovna, J. Uesato, V. Mikulik, M. Rahtz, T. Everitt, R. Kumar, Z. Kenton, J. Leike, & S. Legg,
“Specification gaming: the flip side of AI ingenuity,”, April 2022. (accessed Nov. 18, 2022).
8 E. Hubinger, C. van Merwijk, V. Mikulik, J. Skalse, & S. Garrabrant, “Risks from learned optimization in advanced
machine learning systems,” 2021, arXiv preprint arXiv:1906.01820v3.
9 R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, D. Pedreschi, F. Giannotti, A survey of methods for explaining
black box models,” ACM computing surveys (CSUR), vol. 51, no. 5, pp. 1-42, 2008.
10 J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama et al. “Emergent abilities of large
language models,” 2022, arXiv preprint arXiv:2206.07682.
11 CAIS, “What is AI risk?”, 2022. (accessed Nov. 18, 2022).
12 DARPA Public Affairs, “AlphaDogfight trials foreshadow future of human-machine symbiosis,”,
August 2022. (accessed Nov. 18, 2022); B. Buchanan, J. Bansemer,
D. Cary, J. Lucas, & M. Musser, “Automating cyber attacks,” Center for Security and Emerging Technology,
November 2020.
13 D. Hendrycks & M. Mazeika, “X-risk analysis for ai research,” 2022, arXiv preprint arXiv:2206.05862.
14 ibid.
15 R. Jiang, S. Chiappa, T. Lattimore, A. György, & P. Kohli. “Degenerate feedback loops in recommender systems,”
Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pp. 383-390, January 2019.
16 D. Hendrycks, op. cit.