Model/Service | Company | Input Tokens Price per Million (USD) | Output Tokens Price per Million (USD) | Context Length | Chatbot Arena ELO | MMLU Score | MT-Bench | GPQA | HumanEval (Coding) | Release Date | Knowledge Cutoff Date |
---|
For many applications, Chatbot Arena ELO is the best metric to use, since it cannot be trained for in advance and is evaluated by humans and an independent organization. One downside of Chatbot Arena is that humans typically favor longer responses, even if a shorter response is equally correct. The highest-scoring models on Chatbot Arena do tend to produce long responses.
MMLU, MT-Bench, HumanEval, and GPQA are automatically evaluated, so they are not biased by long answers. The risk with these four benchmarks is that correct answers might leak into training data. This is particularly a problem for MMLU, since the questions and answers are widely available. MT-Bench and HumanEval mitigate this by using automatically-generated questions, and GPQA mitigates this by not publicly releasing the correct answers. However, at least in theory, a model could be excessively fine-tuned for these benchmarks.