Alignment of Human Values: How to Make AI "Align with Human Values"? Are the explorations of the giants for products or for humanity?

Writing by: Jessica Dai, a Ph.D. student majoring in Computer Science at the University of California, Berkeley

Image source: Generated by Wujie AI

How should we make AI "align with human values"?

The exaggerated reports on "existential risks of artificial intelligence" (referred to as "x-risk") have become mainstream. Who could have predicted that the onomatopoeia "Fᴏᴏᴍ" - evoking children's cartoons and directly derived from children's cartoons - would uncritically appear in The New Yorker? Compared to any other time, the public discussions about artificial intelligence and its risks, as well as the discussions about how to address these risks, seem unusually chaotic, blurring speculative future risks with present-day harms, and conflating large "approximate intelligence" models with algorithms and statistical decision systems in terms of technology.

So, what are the pros and cons of the progress of artificial intelligence? Despite the ongoing debates about catastrophic harm and extinction-level events, the current trajectory of "alignment" research seems ill-suited - or even misplaced - to the claims that artificial intelligence may cause widespread, specific, and severe suffering. In my view, rather than addressing the enormous challenge of human extinction, we are addressing a perennial (and famously important) problem, which is creating products that people are willing to pay for. Ironically, it is this valorization that has created the conditions for both the actual and imagined doomsday scenarios.

Tool, Toy, or Just a Product?

What I want to say is that OpenAI's ChatGPT, Anthropic's Claude, and all other latest models can do what they do, which is very, very cool. While I wouldn't claim that these models have any intelligence to replace human workers, nor would I say that I would rely on them to perform important tasks, to deny that these models are useful and powerful would be disingenuous.

What the "AI safety" community is concerned about is precisely these capabilities. Their idea is that artificial intelligence systems will inevitably surpass human reasoning abilities, transcend "artificial general intelligence" (AGI), and become "superintelligence"; their actions will surpass our understanding; their existence will undermine our values in the pursuit of their goals. This transformation may be rapid and sudden ("ꜰᴏᴏᴍ"), as claimed by a small but vocal group of AI practitioners and scholars. A broader alliance in the "effective altruism" (EA) ideological movement sees the coordination of artificial intelligence work as a crucial intervention to prevent artificial intelligence-related disasters.

In fact, the "technical research and engineering" in the AI alignment field is the only most influential path recommended by 80,000 Hours (an influential EA organization focused on career guidance). In a recent interview with The New York Times, Nick Bostrom, the author of "Superintelligence" and a core architect of effective altruism, defined "alignment" as "ensuring that the increasingly powerful artificial intelligence systems we build are aligned with the goals of the people who build these systems."

So, who are "we"? What goals do "we" want to achieve? Currently, "we" are private companies, with OpenAI being one of the most prominent pioneers in the AGI field, and Anthropic founded by a group of peers from OpenAI. OpenAI has building superintelligence as one of its primary goals. However, given the magnitude of the risks, why pursue this? In their own words:

First, we believe it will lead to a much better world than we can imagine today (we have already seen early examples of this in education, creative work, and personal productivity)….. The economic growth and improvement in quality of life will be staggering.
Second, we believe that the risks and difficulties of preventing the creation of superintelligence are unimaginably large. Because the benefits of superintelligence are so great, the cost of building it is decreasing year by year, the number of participants in building it is rapidly increasing, and superintelligence is already part of the technological path we are on…. We must get it right.

In other words, first, because it can make us a lot of money; second, because it can make others a lot of money, so it's best if it's us. (OpenAI certainly has the responsibility to substantiate the following claims: artificial intelligence can bring about an "unimaginably" better world; it has "already" benefited education, creative work, and personal productivity; the existence of such a tool can substantially improve the quality of life, not just for those who profit from its existence).

Of course, this view carries a cynical undertone, and I don't believe that most people at OpenAI have joined for personal economic gain. On the contrary, I believe their interests are sincere, including technical work to realize large-scale models, interdisciplinary dialogues analyzing their social impact, and hopes of participating in building the future. However, the goals of an organization ultimately differ from the goals of the individuals comprising it. Regardless of public statements, revenue generation is always at least a supplementary goal, and OpenAI's management, product, and technical decisions will be based on this, even if not fully acknowledged. An interview with Sam Altman, the CEO of a startup that established "LLM," suggests that commercialization is the top priority for Altman and the company. OpenAI's "customer stories" page is no different from other startup pages: flashy screenshots and quotes, mentions of well-known companies, and necessary "tech for good" highlights.

Anthropic, a notorious company founded by former employees of OpenAI due to concerns about OpenAI's shift towards profitability, is another player in this space. Their argument - if the models are truly dangerous, why build more powerful ones - is more cautious, focusing primarily on a research-driven argument that it is necessary to study models at the edge of capability to truly understand their risks. However, like OpenAI, Anthropic also has its own flashy “product” page, its own quotes, feature descriptions, and use cases. Anthropic has been able to raise hundreds of millions of dollars each time.

OpenAI and Anthropic may be engaged in research, driving technological progress, and possibly even building superintelligence, but it cannot be denied that they are also building products - products that need to be accountable, need to be sold, and need to be designed to gain and maintain market share. Regardless of how impressive, useful, and interesting Claude and GPT-x may be technically, they are ultimately tools (products) that users (customers) want to use to accomplish specific, possibly mundane tasks.

There is nothing inherently wrong with creating products, of course companies will strive to make money. However, something that can be called a "financial side hustle" inevitably complicates our understanding of how to build aligned artificial intelligence systems and raises questions about whether the methods of alignment are truly suitable for avoiding disasters.

Computer Scientists Love Models

In the same interview with The New York Times about the possibility of superintelligence, Bostrom - a professionally trained philosopher - said, "This is a technical problem," when discussing the alignment issue.

I'm not saying that those without a background in computer science are not qualified to comment on these issues. On the contrary, I find it ironic that the hard work of devising solutions has been deferred to outside their field, much like how computer scientists tend to think that "ethics" is far beyond their professional scope. However, if Bostrom is right - alignment is a technical problem - then what exactly are the technical challenges?

First and foremost, I want to say that the ideology of artificial intelligence and alignment is diverse. Many who are concerned about existential risks have strong criticisms of the approaches taken by OpenAI and Anthropic, and in fact, they also have similar concerns about their own product positioning. However, it is necessary and sufficient to pay attention to what these companies are doing: they currently have the most powerful models, and unlike other large model providers such as Mosaic or Hugging Face, they place the most emphasis on alignment and "superintelligence" in public discourse.

An important part of this landscape is a deep and close-knit community of individual researchers motivated by x-risks, which has developed a large vocabulary around AI safety and alignment theory, many of which were initially introduced in detailed blog posts on forums such as LessWrong and AI Alignment Forum.

The concept of intent alignment is one of them, which is useful for contextualizing technical alignment work, and perhaps a more formal version of what Bostrom referred to. In a 2018 Medium post introducing the term, Paul Christiano, who once led the alignment team at OpenAI, defined intent alignment as "AI trying to do things that humans (H) want it to do." When defined in this way, the "alignment problem" suddenly becomes more manageable - if not fully solvable, it can be partially addressed through technical means.

Here, I will focus on the research direction related to shaping the behavior of artificial intelligence systems to align with human values. The main goal of this research direction is to develop human preference models and use them to improve the "misaligned" base models. This has been a topic of keen interest in both industry and academia; the most prominent of which is "Learning from Human Preferences" (RLHF) and its successor "Robust Learning from Human Feedback" (RLAIF, also known as Constitutional AI), which are the technologies used to adjust OpenAI's ChatGPT and Anthropic's Claude, respectively.

In these methods, the core idea is to start from a powerful, "pre-trained," but unaligned base model, for example, a model that can successfully answer questions but may also use profanity while doing so. The next step is to create "human preference" models. Ideally, we could ask all 8 billion people on Earth about their feelings towards all possible outputs of the base model; but in practice, we train an additional machine learning model to predict human preferences. This "preference model" is then used to critique and improve the outputs of the base model.

For both OpenAI and Anthropic, the "preference models" align with an overall value of "helpful, harmless, and honest (HHH)." In other words, the "preference models" capture the types of chatbot outputs that humans tend to consider "HHH." The preference model itself is established through an iterative process of pairwise comparisons: after the base model generates two responses, a human (ChatGPT) or an artificial intelligence (Claude) determines which response is "more HHH," and then the updated preference model is fed back. Recent research indicates that enough of these pairwise comparisons will eventually converge to a good universal preference model - assuming there is indeed a single universal model that dictates what is normatively better.

All of these technical approaches - and the broader "intent alignment" framework - have the deceptive convenience. Some limitations are obvious: malicious actors may have "malicious intent," in which case intent alignment becomes problematic; furthermore, the "intent alignment" assumes that intent itself is known, explicit, and uncontroversial - a not surprising challenge in a society with diverse and often conflicting values.

And the "financial tasks" side-steps both of these issues, which is exactly what I am truly concerned about here: the existence of financial incentives means that alignment work often turns into disguised product development rather than making real progress in mitigating long-term harms. The RLHF/RLAIF methods - currently the most advanced methods for adjusting models based on "human values" - are almost entirely tailored for creating better products. After all, the focus groups for product design and marketing are the original "human feedback reinforcement learning."

The first and most obvious problem is determining the values themselves. In other words, "which values"? Whose values? For example, why "HHH," and why should "HHH" be achieved in a specific way? It is much easier to determine the values that guide the development of universally useful products than it is to determine values that might fundamentally prevent catastrophic harm; it is much easier to blur the average interpretation of these values for humans than it is to meaningfully address the differences. Perhaps, in the absence of a better way, "helpful, harmless, and honest" is at least a reasonable requirement for a chatbot product. The marketing page of Anthropic is filled with comments and phrases about its alignment work - "HHH" is also Claude's biggest selling point.

To be fair, Anthropic has publicly disclosed the principles of Claude, and OpenAI seems to be seeking ways to involve the public in decision-making. However, while OpenAI publicly "advocates" for more government involvement, it is also lobbying for reduced regulation; on the other hand, widespread involvement in legislative design by incumbents is clearly a path to regulatory capture. The existence of OpenAI, Anthropic, and similar startups is to dominate the market for extremely powerful models in the future.

These economic incentives have a direct impact on product decisions. As we see on online platforms, content moderation policies are inevitably influenced by ad revenue, defaulting to the minimum, and the universality expected of these large models means they also have overwhelming incentives to minimize constraints on model behavior. In fact, OpenAI explicitly states that they plan to make ChatGPT reflect a set of basic behavioral guidelines that downstream users can further customize. From an alignment perspective, we hope that OpenAI's foundational guidance layer is powerful enough to enable customized "intent alignment" for downstream users, whatever those intents may be, directly and harmlessly.

The second issue is that the current technical reliance on simplified "feedback models" based on human preferences only addresses a surface or user interface-level problem at the chatbot layer, rather than shaping the fundamental capabilities of the models - which was the initial concern about the risks. For example, although ChatGPT is instructed not to use racial slurs, this does not mean it will not exhibit harmful stereotypes internally. (I asked ChatGPT and Claude to describe an Asian female student with a name starting with M, ChatGPT gave me "Mei Ling," Claude gave me "Mei Chen"; both described "Mei" as shy, studious, hardworking, and resentful of her parents' high expectations). Even the principles followed by Claude during training are superficial rather than substantive: "What reactions of the AI indicate that its goals are for the well-being of humanity, rather than for the short-term or long-term benefit of an individual?… What reactions of the AI assistant indicate that the AI system is only concerned with the well-being of humanity?"

I am not advocating for OpenAI or Anthropic to stop what they are doing; nor am I saying that these companies or academics should not engage in alignment research, or that these research questions are easy or not worth pursuing. I am not even saying that these alignment methods are always unhelpful in addressing specific harm issues. In my view, the main alignment research directions are precisely designed to create better products, which is too coincidental.

Whether technically or normatively, aligning chatbots is a challenge. How to provide a foundational platform for customized models, and where and how to delineate the boundaries of customization, is also a challenge. But fundamentally, these tasks are driven by products; they are two different problems from solving extinction problems, and I find it difficult to reconcile the discord between the two: on one hand, our task is to create a product that people are willing to buy (in the short-term incentives of the market); on the other hand, our task is long-term harm prevention. Of course, OpenAI and Anthropic may be able to do both, but if we were to speculate the worst-case scenario, given their organizational motivations, the likelihood of them not being able to do so seems high.

How Do We Address Extinction Problems?

For the discussion of artificial intelligence and its potential harms and benefits, the state of public discourse is important; public opinion, awareness, and understanding are also important. This is why Sam Altman's international policy and media tour is important, and why the EA movement places such importance on preaching and public discussion. For something as high-risk as (potential) existential catastrophes, we need to get it right.

However, the argument about existential risks itself is a critical discourse that creates a self-fulfilling prophecy. News coverage and attention to the dangers of superintelligence naturally attract attention to the desire for artificial intelligence, as artificial intelligence has enough capability to handle significant decisions. Therefore, a critical interpretation of Altman's policy tour is that it is a Machiavellian advertisement for artificial intelligence use, benefiting not only OpenAI but also other companies selling "superintelligence," such as Anthropic.

The key issue is that the path to artificial intelligence x-risk ultimately requires a society where reliance on and trust in algorithms to make significant decisions is not only commonplace but also encouraged and incentivized. It is in this world that the suffocating speculations about the capabilities of artificial intelligence become a reality.

Consider the mechanisms by which those concerned about long-term harm claim that catastrophes could occur mechanisms: power seeking, where artificial intelligence agents constantly demand more resources; reward hacking, where artificial intelligence finds a way of behaving that seems to align with human goals but is achieved through harmful shortcuts; deception, where artificial intelligence tries to appease humans and convince them that its behavior is actually in line with design in order to pursue its own goals.

Emphasizing the capabilities of artificial intelligence - saying "if artificial intelligence becomes too powerful, it might kill us all" - is a rhetorical device that ignores all the other "if" conditions contained in that statement: if we decide to outsource reasoning about significant decisions regarding policies, business strategies, or personal lives to algorithms. If we decide to allow artificial intelligence systems to directly access resources (power grids, utilities, computing) and have the authority to influence the allocation of these resources. All artificial intelligence x-risk scenarios involve a world where we decide to shift responsibility to algorithms.

Emphasizing the severity, even omnipotence, of the problem is a useful rhetorical strategy, as any solution is certainly not going to completely solve the original problem, and criticism of attempts to solve the problem is easily deflected by the argument that "something is better than nothing." If it is indeed possible for extremely powerful artificial intelligence systems to cause catastrophic destruction, then we should applaud any efforts in alignment research today, even if the direction of this work itself is wrong, even if it does not achieve the goals we might hope for. If alignment work is indeed exceptionally difficult, then we should leave it to the experts and trust that they are acting in the best interests of everyone. If artificial intelligence systems are truly powerful enough to cause such serious harm, then they must also have the ability to replace, enhance, or otherwise substantially influence current human decision-making.

There can be a rich and detailed discussion about when and whether algorithms can be used to improve human decision-making, how to measure the impact of algorithms on human decision-making or assess the quality of their recommendations, and what it means to improve human decision-making in the first place. A large number of activists, scholars, and community organizers have been pushing this conversation for years. To prevent species extinction or large-scale harm, it is necessary to engage seriously in this conversation and recognize that what may be seen as "local" "case studies" not only have a huge impact on the people involved, even affecting their survival, but also have inspiring and generative implications for building reasoning frameworks that integrate algorithms into real-world decision-making environments. For example, in the criminal justice field, algorithms may successfully reduce the total prison population, but they fail to address racial disparities. In the healthcare field, algorithms theoretically can improve clinical decision-making, but in practice, the organizational structure influencing the deployment of artificial intelligence is very complex.

There are certainly technical challenges, but focusing on technical decisions overlooks these higher-level issues. In academia, there are not only economics, social choice, and political science, but also a wide range of disciplines such as history, sociology, gender studies, race studies, and black studies, which provide a reasoning framework for what constitutes effective governance, what is the devolution of decision-making power for collective interests, what is genuine participation in the public domain, and when those in power only consider certain contributions to be legitimate. From individual behavior to macro policies, civil society organizations and activist groups have decades or even centuries of collective experience, and they have been working to address how to achieve substantive change at various levels.

Therefore, the stakes of artificial intelligence progress are not just about technical capabilities and whether they will surpass thresholds beyond any imagination. They also relate to how we - as the general public - talk about, write about, and think about artificial intelligence; they also relate to how we choose to allocate our time, attention, and capital. The latest models are indeed very impressive, and alignment research also explores truly fascinating technical issues. However, if we are truly concerned about the catastrophes that artificial intelligence could trigger, whether existential or other catastrophes, we cannot rely on those who stand to gain the most from the widespread deployment of artificial intelligence in the future.

免责声明：本文章仅代表作者个人观点，不代表本平台的立场和观点。本文章仅供信息分享，不构成对任何人的任何投资建议。用户与作者之间的任何争议，与本平台无关。如网页中刊载的文章或图片涉及侵权，请提供相关的权利证明和身份证明发送邮件到support@aicoin.com，本平台相关工作人员将会进行核查。

Alignment of Human Values: How to Make AI "Align with Human Values"? Are the explorations of the giants for products or for humanity?

Selected Articles by 巴比特

Table of Contents

Related Articles