BetterBench got accepted as a Spotlight to NeurIPS 2024! Come talk to us at the conference in Vancouver!

The problem

Benchmarks are widely used to measure attributes like fairness, safety, or general capabilities, compare model performances, track progress, and identify weaknesses of AI systems. However, the quality of these benchmarks varies significantly depending on their design and usability. Poor quality benchmarks can lead to misleading comparisons and inaccurate assessments of AI models, potentially resulting in the deployment of suboptimal or even harmful systems in real-world applications.​

Description of the image
Figure 1: Scatterplot showing quality differences in benchmarks based on our assessment

How we're contributing to a solution

To address the issue of varying benchmark quality, we have developed a novel AI benchmark assessment framework that evaluates the quality of AI benchmarks based on 46 criteria derived from expert interviews and domain literature. By applying this framework to 24 widely used AI benchmarks, including both foundation model and non-foundation-model benchmarks, we have identified statistically significant quality differences within and across both categories. Our findings provide insights into prevalent issues in AI benchmarking and highlight the need for improved benchmark design and usability.​

We have made available a living repository of benchmark assessments, allowing users to analyze the appropriateness of different benchmarks for their specific contexts. Additionally, to support benchmark developers in aligning with best practices, we have created a checklist for minimum quality assurance based on our assessment. This checklist serves as a guide for developers to ensure that their benchmarks meet essential quality standards, promoting more reliable and informative model evaluations.

By providing these resources, we aim to foster the development and adoption of high-quality AI benchmarks, ultimately leading to more accurate assessments of AI models and informed decision-making in their deployment.

Description of the image
Figure 2: Average and individual scores of all assessed benchmarks per lifecycle stage.

Where to start

If you’re a benchmark user: Check out our searchable repository of benchmarks and their assessments here.​

If you're a benchmark developer: Find our checklist for best practices here and submit your benchmark for assessment here . If you disagree with an existing assessment, you can submit a request for re-assessment here.

How to cite us

As with every research project, a lot of time and passion were put into this initiative. If you found our work useful, we’d appreciate it if you’d cite us. Some standard citations can be found below:

@inproceedings{reuel2024betterbench,
  title={BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices},
  author={Reuel, Anka and Hardy, Amelia and Smith, Chandler and Lamparth, Max and Hardy, Malcolm and Kochenderfer, Mykel J.},
  booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
  url = {https://betterbench.stanford.edu},
  year = {2024}
}

Team