Welcome to the most comprehensive and up-to-date LLM benchmark comparison platform! We aggregate and predict performance data from leading AI models across a multitude of evaluation benchmarks, offering a unique insight into the rapidly evolving landscape of artificial intelligence.
Unlike leaderboards associated with a single benchmark, we let you judge models across a wealth of criteria, from reasoning to coding.
While some leaderboards make you wait a week for results—either because they require a minimum number of online votes, or because they run evals for a few days—we use available information to predict results immediately.
Other platforms exclude models with missing scores, creating sparse charts. Our score prediction ensures all models appear on all criteria, and we're resilient to response length bias.
Our data is meticulously collected from multiple sources:
Each data point includes the source reference, allowing you to verify the original results. We prioritize transparency and accuracy in all our data collection efforts.
Our sophisticated prediction system uses advanced statistical methods to estimate missing benchmark scores and provide comprehensive model comparisons:
The algorithm continuously improves as we gather more data and refine our statistical models, providing increasingly accurate predictions over time.
We're on a mission to create the most comprehensive and up-to-date collection of LLM benchmark data, and we need your help!
Have benchmark results you'd like to contribute? Whether it's from official releases, academic papers, or your own testing, we'd love to include it in our database. Your contributions help make this tool more valuable for everyone in the AI community.
Join us in building the definitive resource for LLM performance comparison!
📂 Contribute on GitHubThis project is completely open source and built by the community, for the community. We believe in transparent, collaborative development of AI evaluation tools.
Source Repository: https://github.com/espadrine/metabench
Feel free to explore the code, report issues, suggest improvements, or contribute your own enhancements. Together, we can build better tools for understanding and advancing AI capabilities.
Built with ❤️ by the AI community