LLMPerf

Updated March 31: Added Command R and Qwen, added mobile UI and FAQ
×

Company:

Model/Service Company Input Tokens Price per Million (USD) Output Tokens Price per Million (USD) Context Length Chatbot Arena ELO MMLU Score MT-Bench GPQA HumanEval (Coding) Release Date Knowledge Cutoff Date

FAQ

What is an LLM?
LLM stands for Large Language Model, which is a type of artificial intelligence that understands and generates human-like text based on the input it receives. LLMs are trained on vast datasets of human language and can perform a wide range of language-related tasks.
Why do you display obsolete models?
We display obsolete models to provide a comprehensive historical context and enable performance comparisons over time. This helps users understand the evolution of LLM technology and its capabilities.
Why don't you display very old models?
Very old models are omitted to maintain the relevance and clarity of our benchmarks. Including them might clutter the data and make it harder for users to find useful information about current and recently obsolete models.
How do you gather pricing information for open models?
For the open models, you can download and run the weights yourself for free -- but you probably will not due to the very expensive GPUs required. Instead, this site shows prices from together.ai, a low-cost service that hosts open models for you.
What is context length?
Context length refers to the maximum number of tokens an LLM can process in a single prompt. A token can be a word, part of a word, or even punctuation, depending on the model's training data and tokenization method. This parameter is crucial as it affects the model's ability to understand and generate coherent and contextually relevant responses.
What are the metrics shown?
The metrics displayed on our site include Chatbot Arena ELO, MMLU (Massive Multitask Language Understanding), MT-Bench (Multi-turn Benchmark), GPQA (Google-proof Question and Answer), and HumanEval. These metrics are used to evaluate the performance of LLMs across different tasks and capabilities.
How often is the data updated?
The data is updated approximately once a week. This frequency ensures that our benchmarks remain up-to-date and reflect the latest developments and improvements in LLM technology.
Why is Grok not included?
We don't yet have any benchmark data for Grok.
Which metric is best for my application?
The best metric for your application depends on the specific tasks you intend to use the LLM for.
  • Chatbot Arena ELO: This is a human-rated score where the human asks two randomly-chosen LLMs the same question and chooses the superior answer. The ChatBot Arena is here.
  • MMLU: MMLU is a very large database of school-test-like multiple choice questions, similar to what an undergraduate student might learn. Try it here.
  • MT-Bench: MT-Bench is a benchmark for "multi-turn" chats (conversations involving back-and-forth). A description is here.
  • GPQA: GPQA is an extraordinarily challenging multiple-choice question database similar to what an advanced graduate student or scientific expert might learn. Learn more about it here.
  • HumanEval: HumanEval is a code generation benchmark that pairs questions in 23 natural languages with 12 different programming languages. The benchmark asks a question in a natural language; gets a small piece of code; and runs it to verify functionality.

For many applications, Chatbot Arena ELO is the best metric to use, since it cannot be trained for in advance and is evaluated by humans and an independent organization. One downside of Chatbot Arena is that humans typically favor longer responses, even if a shorter response is equally correct. The highest-scoring models on Chatbot Arena do tend to produce long responses.

MMLU, MT-Bench, HumanEval, and GPQA are automatically evaluated, so they are not biased by long answers. The risk with these four benchmarks is that correct answers might leak into training data. This is particularly a problem for MMLU, since the questions and answers are widely available. MT-Bench and HumanEval mitigate this by using automatically-generated questions, and GPQA mitigates this by not publicly releasing the correct answers. However, at least in theory, a model could be excessively fine-tuned for these benchmarks.