BetterBench got accepted as a Spotlight to NeurIPS 2024! Come talk to us at the conference in Vancouver!
The problem
Benchmarks are widely used to measure attributes like fairness, safety, or general capabilities, compare model performances, track progress, and identify weaknesses of AI systems. However, the quality of these benchmarks varies significantly depending on their design and usability. Poor quality benchmarks can lead to misleading comparisons and inaccurate assessments of AI models, potentially resulting in the deployment of suboptimal or even harmful systems in real-world applications.
How we're contributing to a solution
To address the issue of varying benchmark quality, we have developed a novel AI benchmark assessment framework that evaluates the quality of AI benchmarks based on 46 criteria derived from expert interviews and domain literature. By applying this framework to 24 widely used AI benchmarks, including both foundation model and non-foundation-model benchmarks, we have identified statistically significant quality differences within and across both categories. Our findings provide insights into prevalent issues in AI benchmarking and highlight the need for improved benchmark design and usability.
We have made available a living repository of benchmark assessments, allowing users to analyze the appropriateness of different benchmarks for their specific contexts. Additionally, to support benchmark developers in aligning with best practices, we have created a checklist for minimum quality assurance based on our assessment. This checklist serves as a guide for developers to ensure that their benchmarks meet essential quality standards, promoting more reliable and informative model evaluations.
By providing these resources, we aim to foster the development and adoption of high-quality AI benchmarks, ultimately leading to more accurate assessments of AI models and informed decision-making in their deployment.
Where to start
If you’re a benchmark user: Check out our searchable repository of benchmarks and their assessments here.
If you're a benchmark developer: Find our checklist for best practices
here
and submit your benchmark for assessment
here
. If you disagree with an existing assessment, you can submit a request for re-assessment
here.
How to cite us
As with every research project, a lot of time and passion were put into this initiative. If you found our work useful, we’d appreciate it if you’d cite us. Some standard citations can be found below:
@inproceedings{reuel2024betterbench,
title={BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices},
author={Reuel, Anka and Hardy, Amelia and Smith, Chandler and Lamparth, Max and Hardy, Malcolm and Kochenderfer, Mykel J.},
booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
url = {https://betterbench.stanford.edu},
year = {2024}
}