Reflections on our Responsible Scaling Policy
Last summer we published our first Responsible Scaling Policy (RSP), which focuses on addressing catastrophic safety failures and misuse of frontier models. In adopting this policy, our primary goal is to help turn high-level safety concepts into practical guidelines for fast-moving technical organizations and demonstrate their viability as possible standards. As we operationalize the policy, we expect to learn a great deal and plan to share our findings. This post shares reflections from implementing the policy so far. We are also working on an updated RSP and will share this soon.
We have found having a clearly-articulated policy on catastrophic risks extremely valuable. It has provided a structured framework to clarify our organizational priorities and frame discussions around project timelines, headcount, threat models, and tradeoffs. The process of implementing the policy has also surfaced a range of important questions, projects, and dependencies that might otherwise have taken longer to identify or gone undiscussed.
Balancing the desire for strong commitments with the reality that we are still seeking the right answers is challenging. In some cases, the original policy is ambiguous and needs clarification. In cases where there are open research questions or uncertainties, setting overly-specific requirements is unlikely to stand the test of time. That said, as industry actors face increasing commercial pressures we hope to move from voluntary commitments to established best practices and then well-crafted regulations.
As we continue to iterate on and improve the original policy, we are actively exploring ways to incorporate practices from existing risk management and operational safety domains. While none of these domains alone will be perfectly analogous, we expect to find valuable insights from nuclear security, biosecurity, systems safety, autonomous vehicles, aerospace, and cybersecurity. We are building an interdisciplinary team to help us integrate the most relevant and valuable practices from each.
Our current framework for doing so is summarized below, as a set of five high-level commitments.
- Establishing Red Line Capabilities. We commit to identifying and publishing "Red Line Capabilities" which might emerge in future generations of models and would present too much risk if stored or deployed under our current safety and security practices (referred to as the ASL-2 Standard).
- Testing for Red Line Capabilities (Frontier Risk Evaluations). We commit to demonstrating that the Red Line Capabilities are not present in models, or - if we cannot do so - taking action as if they are (more below). This involves collaborating with domain experts to design a range of "Frontier Risk Evaluations" – empirical tests which, if failed, would give strong evidence against a model being at or near a red line capability. We also commit to maintaining a clear evaluation process and a summary of our current evaluations publicly.
- Responding to Red Line Capabilities. We commit to develop and implement a new standard for safety and security sufficient to handle models that have the Red Line Capabilities. This set of measures is referred to as the ASL-3 Standard. We commit not only to define the risk mitigations comprising this standard, but also detail and follow an assurance process to validate the standard’s effectiveness. Finally, we commit to pause training or deployment if necessary to ensure that models with Red Line Capabilities are only trained, stored and deployed when we are able to apply the ASL-3 standard.
- Iteratively extending this policy. Before we proceed with activities which require the ASL-3 standard, we commit to publish a clear description of its upper bound of suitability: a new set of Red Line Capabilities for which we must build Frontier Risk Evaluations, and which would require a higher standard of safety and security (ASL-4) before proceeding with training and deployment. This includes maintaining a clear evaluation process and summary of our evaluations publicly.
- Assurance Mechanisms. We commit to ensuring this policy is executed as intended, by implementing Assurance Mechanisms. These should ensure that our evaluation process is stress-tested; our safety and security mitigations are validated publicly or by disinterested experts; our Board of Directors and Long-Term Benefit Trust have sufficient oversight over the policy implementation to identify any areas of non-compliance; and that the policy itself is updated via an appropriate process.
Threat Modeling and Evaluations
Our Frontier Red Team and Alignment Science teams have focused on threat modeling and engaging with domain experts. They are primarily focused on (a) improving threat models to determine which capabilities would warrant the ASL-3 standard of security and safety, (b) working with teams developing ASL-3 controls to ensure that those controls are tailored to the correct risks, and (c) mapping capabilities which the ASL-3 standard would be insufficient to handle, and which we would continue to test for even once it is implemented. Some key reflections are:
- Each new generation of models has emergent capabilities, making anticipating properties of future models unusually challenging. There is a serious need for further threat modeling.
- There is reasonable disagreement amongst experts over which risks to prioritize and how new capabilities might cause harm, even in relatively established Chemical, Biological, Radiological, and Nuclear (CBRN) domains. Talking to a wide variety of experts in different sub-domains has been valuable, given the lack of consensus view.
- Attempting to make threat models quantitative has been helpful for deciding which capabilities and scenarios to prioritize.
Our Frontier Red Team, Alignment Science, Finetuning, and Alignment Stress Testing teams are focused on building evaluations and improving our overall methodology. Currently, we conduct pre-deployment testing in the domains of cybersecurity, CBRN, and Model Autonomy for frontier models which have reached 4x the compute of our most recently tested model (you can read a more detailed description of our most recent set of evaluations on Claude 3 Opus here). We also test models mid-training if they reach this threshold, and re-test our most capable model every 3 months to account for finetuning improvements. Teams are also focused on building evaluations in a number of new domains to monitor for capabilities for which the ASL-3 standard will still be unsuitable, and identifying ways to make the overall testing process more robust. Some key reflections are:
- Fast iteration cycles with domain experts are especially valuable for recognizing when the difficulty level of a test is poorly calibrated or the task is divorced from the threat model in question.
- We should increasingly aim to leverage and encourage the growing ecosystem of researchers and firms in this space. Many of the risks we aim to assess, particularly those involving autonomy or misalignment, are inherently complex and speculative, and our own testing and threat modeling is likely incomplete. It will also be valuable to develop a mature external ecosystem that can adequately assess the quality of our claims, as well as offer accessible evals as a service to less well-resourced companies. We have begun to test partnerships with external organizations in these areas.
- Different evaluation methodologies have their own strengths and weaknesses, and the methods that most compellingly assess a model's capabilities will differ depending on the threat model or domain in question.
- Question & answer datasets are relatively easy to design and run quickly. However, they may not be the most reflective of real-world risk due to their inherently constrained formats. Teams will continue to explore the possibility of designing datasets that are good proxies for more complex sets of tasks, and which could trigger a more comprehensive, time-intensive set of testing.
- Human trials comparing the performance of subjects with model access to that of subjects with search engines are valuable for measuring misuse-related domains. However, they are time-intensive, requiring robust, well-documented, and reproducible processes. We have found it especially important to focus on establishing good expert baselines, ensuring sufficient trial sizes, and performing careful statistical inference in order to get meaningful signals from trials. We are exploring ways to scale up our infrastructure to run these types of tests.
- Automated task evaluations have proven informative for threat models where models take actions autonomously. However, building realistic virtual environments is one of the more engineering-intensive styles of evaluation. Such tasks also require secure infrastructure and safe handling of model interactions, including manual human review of tool use when the task involves the open internet, blocking potentially harmful outputs, and isolating vulnerable machines to reduce scope. These considerations make scaling the tasks challenging.
- Although less rigorous and reproducible than the approaches described above, expert red-teaming and reviewing model behavior via transcripts have also proven valuable. These methods allow for more open-ended exploration of model capabilities and make it easier to seek expert opinions on the relevance of different evaluation tasks or questions.
- There are a number of open research questions on which our teams will focus over the coming months to build a reliable evaluation process. We welcome more exploration in these areas from the broader research community.
- We aim to collect evidence about model risk and prepare suitable mitigations before reaching dangerous thresholds. This requires extrapolating from current evidence to future risk levels. Ideally, the “scaling laws” that lead to dangerous capabilities would be smooth, making it possible to predict when models might develop dangerous capabilities. In future, we hope to be able to predict precisely how much more capable a next-generation model will be in a given domain.
- Techniques can be used to help models complete tasks more effectively, including domain-specific reinforcement learning training, prompt engineering, and supervised fine-tuning. This makes it impossible to guarantee we are eliciting all the relevant model capabilities during testing. A good testing process involves a concerted effort to pass evaluations and invest in capability elicitation improvements. This is important to simulate scenarios where well-resourced malicious actors bypass security controls and gain access to model weights. However, there is no clear distinction between trying extremely hard to elicit a dangerous capability in some model and simply training a model to have that capability. We hope to make more precise and principled claims about what sufficient elicitation would look like in future versions of the policy.
- There is significant value in making our risk assessment process externally legible. We have therefore aimed to pre-specify test results we think are indicative of an intolerable level of risk when left unmitigated. These clear commitments help avoid production pressures incentivizing the relaxation of standards, although they may inevitably result in somewhat crude or arbitrary thresholds. We would like to explore ways to better aggregate the different sources of evidence described above while maintaining external legibility for verifiable commitments. Similarly, we may explore whether to incorporate other sources of evidence, such as forecasting, which are common in other domains.
The ASL-3 Standard
Our Security, Alignment Science, and Trust and Safety teams have been focused on developing the ASL-3 standard. Their goal is to design and implement a set of controls that will sufficiently mitigate the risk of the model weights being stolen by non-state actors or models being misused via our product surfaces. This standard would be sufficient for many models with capabilities where even a low rate of misuse could be catastrophic. However, it would not be sufficient to handle capabilities which would enable state groups or groups with substantial state backing and resources. Some key reflections are:
- Our current plans for ensuring models are used safely and responsibly in all of our product surfaces (e.g. Vertex, Bedrock, Claude.ai) involve scaling up research on classifier models for automated detection and response as well as strengthening all aspects of traditional trust and safety practices.
- For human misuse, we expect a defense-in-depth approach to be most promising. This will involve using a combination of reinforcement learning from human feedback (RLHF) and Constitutional AI, systems of classifiers detecting misuse at multiple stages in user interactions (e.g. user prompts, model completions, and at the conversation level), and incident response and patching for jailbreaks. Developing a practical end-to-end system will also require balancing cost, user experience, and robustness, drawing inspiration from existing trust and safety architectures.
- As described in the Responsible Scaling Policy, we will red-team this end-to-end system prior to deployment to ensure robustness against sophisticated attacks. We emphasize the importance of tying risk mitigation efforts directly to threat models, and have found that these risk mitigation objectives are improved via close collaboration between the teams developing our red-teaming approach and the researchers leading our threat modeling and evaluations efforts.
- Scaling up our security program and developing a comprehensive roadmap to defend against a wide variety of non-state actors has required a surge of effort: around 8% of all Anthropic employees are now working on security-adjacent areas and we expect that proportion to grow further as models become more economically valuable to attackers. The threat models and security targets articulated in the RSP have been especially valuable for our security team to help prioritize and motivate the necessary changes.
- Implementing the level of security required by the ASL-3 standard will require changing every aspect of employees' day-to-day workflows. To make these changes in a thoughtful way, our security team has invested significant time in building partnerships with teams, especially researchers, to preserve productivity and apply state-of-the-art cyber security controls to tooling.
- Our threat modeling assumes that insider device compromise is our highest risk vector. Given this, one of our main areas of focus has been implementing multi-party authorization, time-bounded access controls in order to reduce the risk of model weights exfiltration. Under this system, employees are granted temporary access and only via the smallest set of necessary permissions. Fortunately, Anthropic has already adopted a culture of peer review across software engineering, research, comms, and finance teams, and so adopting multi-party controls as we approach the ASL-3 level has been a well-received extension of these existing cultural norms.
- In such a fast-moving field, it is often difficult to define risk mitigations, or even the methods we will use to assess their effectiveness, upfront. We want to make binding commitments where possible while still allowing degrees of freedom when new information and situations arise. We expect it will be most practical, for both the ASL-3 standard and future standards, to provide a high-level sketch of expected mitigations and set clear “attestation” standards they must meet before use. For example, with our security standard, we can clarify the goal of defending against non-state actors without specifying detailed controls in advance, and pair this with a sensible attestation process involving detailed control lists, review from disinterested experts, and board approval.
Assurance Structures
Lastly, our Responsible Scaling, Alignment Stress Testing, and Compliance teams have been focused on exploring possible governance, coordination, and assurance structures. We intend to introduce more independent checks over time and are looking to hire a Risk Manager to develop these structures, drawing on best practices from other industries and relevant research. Some key reflections are:
- The complexity and cross-functional nature of the workstreams described above requires a high level of central coordination. We will continue to build a Responsible Scaling Team to manage the complex web of work streams and dependencies. Amidst a range of competing priorities, strong executive backing has also been essential in reinforcing that identifying and mitigating risks from frontier models is a company priority, deserving significant resources.
- There is value in creating a “second line of defense” – teams that can take a more adversarial approach to our core work streams. Our Alignment Stress Testing team has begun to stress-test our evaluations, interventions, and overall policy execution. For example, the team provided reflections on potential under-elicitation alongside our Claude 3 Opus evaluations report, which were shared with our Board of Directors and summarized in our report to the U.S. Department of Commerce Bureau of Industry and Security. It may make sense to build out a bespoke internal audit function over time.
- In addition to providing regular updates to our Board of Directors and the Long-Term Benefit Trust, we have shared evaluations reports and quarterly updates on progress towards future mitigations to all employees. Encouraging employees to feel ownership over the RSP and share areas they would like to see us improve the policy has been immensely helpful, with staff drawing on diverse backgrounds to provide valuable insights. We also recently implemented a non-compliance reporting policy that allows employees to anonymously report concerns to our Responsible Scaling Officer about our implementation of our RSP.
Ensuring future generations of frontier models are trained and deployed responsibly will require serious investment from both Anthropic and others across industry and governments. Our Responsible Scaling Policy has been a powerful rallying point with many teams' objectives over the past months connecting directly back to the major workstreams above. The progress we have made on operationalizing safety during this period has necessitated significant engagement from teams across Anthropic, and there is much more work to be done. Our goal in sharing these reflections ahead of the upcoming AI Seoul Summit is to continue the discussion on creating thoughtful, empirically-grounded frameworks for managing risks from frontier models. We are eager to see more companies adopt their own frameworks and share their own experiences, leading to the development of shared best practices and informing future efforts by governments.