Supporting benchmarks for AI safety with MLCommons

Standard benchmarks are agreed upon ways of measuring important product qualities, and they exist in many fields. Some standard benchmarks measure safety: for example, when a car manufacturer touts a “five-star overall safety rating,” they’re citing a benchmark. Standard benchmarks already exist in machine learning (ML) and AI technologies: for instance, the MLCommons Association operates the MLPerf benchmarks that measure the speed of cutting edge AI hardware such as Google’s TPUs. However, though there has been significant work done on AI safety, there are as yet no similar standard benchmarks for AI safety.

We are excited to support a new effort by the non-profit MLCommons Association to develop standard AI safety benchmarks. Developing benchmarks that are effective and trusted is going to require advancing AI safety testing technology and incorporating a broad range of perspectives. The MLCommons effort aims to bring together expert researchers across academia and industry to develop standard benchmarks for measuring the safety of AI systems into scores that everyone can understand. We encourage the whole community, from AI researchers to policy experts, to join us in contributing to the effort.

Why AI safety benchmarks?

Like most advanced technologies, AI has the potential for tremendous benefits but could also lead to negative outcomes without appropriate care. For example, AI technology can boost human productivity in a wide range of activities (e.g., improve health diagnostics and research into diseases, analyze energy usage, and more). However, without sufficient precautions, AI could also be used to support harmful or malicious activities and respond in biased or offensive ways.

By providing standard measures of safety across categories such as harmful use, out-of-scope responses, AI-control risks, etc., standard AI safety benchmarks could help society reap the benefits of AI while ensuring that sufficient precautions are being taken to mitigate these risks. Initially, nascent safety benchmarks could help drive AI safety research and inform responsible AI development. With time and maturity, they could help inform users and purchasers of AI systems. Eventually, they could be a valuable tool for policy makers.

In computer hardware, benchmarks (e.g., SPEC, TPC) have shown an amazing ability to align research, engineering, and even marketing across an entire industry in pursuit of progress, and we believe standard AI safety benchmarks could help do the same in this vital area.

What are standard AI safety benchmarks?

Academic and corporate research efforts have experimented with a range of AI safety tests (e.g., RealToxicityPrompts, Stanford HELM fairness, bias, toxicity measurements, and Google’s guardrails for generative AI). However, most of these tests focus on providing a prompt to an AI system and algorithmically scoring the output, which is a useful start but limited to the scope of the test prompts. Further, they usually use open datasets for the prompts and responses, which may already have been (often inadvertently) incorporated into training data.

MLCommons proposes a multi-stakeholder process for selecting tests and grouping them into subsets to measure safety for particular AI use-cases, and translating the highly technical results of those tests into scores that everyone can understand. MLCommons is proposing to create a platform that brings these existing tests together in one place and encourages the creation of more rigorous tests that move the state of the art forward. Users will be able to access these tests both through online testing where they can generate and review scores and offline testing with an engine for private testing.

AI safety benchmarks should be a collective effort

Responsible AI developers use a diverse range of safety measures, including automatic testing, manual testing, red teaming (in which human testers attempt to produce adversarial outcomes), software-imposed restrictions, data and model best-practices, and auditing. However, determining that sufficient precautions have been taken can be challenging, especially as the community of companies providing AI systems grows and diversifies. Standard AI benchmarks could provide a powerful tool for helping the community grow responsibly, both by helping vendors and users measure AI safety and by encouraging an ecosystem of resources and specialist providers focused on improving AI safety.

At the same time, development of mature AI safety benchmarks that are both effective and trusted is not possible without the involvement of the community. This effort will need researchers and engineers to come together and provide innovative yet practical improvements to safety testing technology that make testing both more rigorous and more efficient. Similarly, companies will need to come together and provide test data, engineering support, and financial support. Some aspects of AI safety can be subjective, and building trusted benchmarks supported by a broad consensus will require incorporating multiple perspectives, including those of public advocates, policy makers, academics, engineers, data workers, business leaders, and entrepreneurs.

Google’s support for MLCommons

Grounded in our AI Principles that were announced in 2018, Google is committed to specific practices for the safe, secure, and trustworthy development and use of AI (see our 2019, 2020, 2021, 2022 updates). We’ve also made significant progress on key commitments, which will help ensure AI is developed boldly and responsibly, for the benefit of everyone.

Google is supporting the MLCommons Association’s efforts to develop AI safety benchmarks in a number of ways.

Testing platform: We are joining with other companies in providing funding to support the development of a testing platform.

Technical expertise and resources: We are providing technical expertise and resources, such as the Monk Skin Tone Examples Dataset, to help ensure that the benchmarks are well-designed and effective.

Datasets: We are contributing an internal dataset for multilingual representational bias, as well as already externalized tests for stereotyping harms, such as SeeGULL and SPICE. Moreover, we are sharing our datasets that focus on collecting human annotations responsibly and inclusively, like DICES and SRP.

Future direction

We believe that these benchmarks will be very useful for advancing research in AI safety and ensuring that AI systems are developed and deployed in a responsible manner. AI safety is a collective-action problem. Groups like the Frontier Model Forum and Partnership on AI are also leading important standardization initiatives. We’re pleased to have been part of these groups and MLCommons since their beginning. We look forward to additional collective efforts to promote the responsible development of new generative AI tools.


Many thanks to the Google team that contributed to this work: Peter Mattson, Lora Aroyo, Chris Welty, Kathy Meier-Hellstern, Parker Barnes, Tulsee Doshi, Manvinder Singh, Brian Goldman, Nitesh Goyal, Alice Friend, Nicole Delange, Kerry Barker, Madeleine Elish, Shruti Sheth, Dawn Bloxwich, William Isaac, Christina Butterfield.