Economics of LLMs: Evaluations vs Pricing

If we look at API token costs for commercialized LLMs and try to map them across various evaluations — such as coding, math, and conversational quality — where does each LLM make sense?

To answer this, I gathered a lot of data on various models and then built a bunch of scatterplots for each evaluation benchmark, both independent and self-reported, and compared them with token pricing.

You’ll see some of the evaluations I will cover in the graphic below.

Why this is interesting is because models can excel in different areas.

The Gemini models, for example, perform well in maths, as does DeepSeek R1, whereas GPT-4o excel in conversational quality. Although, Grok-3 seems to have snatched first place in the Chatbot Arena the other day.

OpenAI’s o3-mini-high and now Anthropic’s Claude Sonnet 3.7 (thinking) is leading in quite a few of these, especially code.

The reasoning models may do very well in several benchmarks but will have higher latency which means it takes longer to respond.

So the best model isn’t always the best in each case.

Furthermore, we want to compare the evaluation with cost. If something has a 5% better benchmark but costs 14 times as much, we might decide the price hike isn’t worth it.

I will go through a bit on reasoning models, each subject and its evaluations, and then round off with how to stay critical about evaluations in general.

The graphics are, well, graphics. The scatterplots will be available here if you need to look at the data in detail. The data for each model will be available in this sheet.

Reasoning vs Non-Reasoning Models

I need to point out that there is a difference between a reasoning model and a non-reasoning model.

A reasoning model is forced to reason and can backtrack if it goes down an incorrect path, whereas a non-reasoning model cannot.

I wrote about CoT (Chain-of-Thought) a few weeks ago if you are keen to learn more about reasoning strategies that have been talked about in the open source space.

When looking at these metrics, we need to keep in mind that DeepSeek R1, Claude Sonnet 3.7 Thinking, and OpenAI’s models, o1 and o3, are reasoning models. These models will more likely reach a correct answer as they are given more time to “think.”

This doesn’t mean there aren’t drawbacks, reasoning models are naturally slower.

I dragged up a bit of statistics from artificialanalysis.ai to illustrate the differences in latency for various models above.

Google’s Gemini Flash 2.0 may also be a reasoning model, but its latency remains quite low so I’m unsure how to classify it.

Furthermore, you may be paying for quite a bit of additional tokens, the “reasoning” tokens, so it’ll be a lot more expensive than a non-reasoning model.

For o3 you can decide on the level (low, medium, or high) but this also decides on the amount of tokens generated as demonstrated above. The same would go for Claude Sonnet 3.7 thinking vs non-thinking.

Going forwards from here, you need to keep in mind that the pricing does not include these extra tokens, so if it is a reasoning model the pricing will increase by 30% to 300% as it will need extra tokens to “think”.

Knowledge Benchmark (MMLU Pro)

Let’s start this exercise with the MMLU Pro dataset.

MMLU Pro is the more difficult version of MMLU, which measures an LLM’s knowledge across numerous academic subjects. A high MMLU Pro score could indicate usefulness within education, content creation and general Q&A.

You can go read more on it here, but I’ve mapped out several LLMs across the scatterplot below.

We see the MMLU Pro score on the y-axis and the pricing on the x-axis for various LLMs.

Note: I could not find the o3-mini model here with a metric for MMLU Pro nor did I find one for o1. They will be included in the other evals.

For pricing in this chart, I’ve calculated one API call with 100 tokens in and 200 tokens out, which should represent a typical message (on the longer side). This chart shows the cost of 1,000 of these API calls.

If we disregard the fact that we can’t see the numbers for each specific subject, there are quite a few players now that keep token costs low while scoring fairly high.

In the graph above I’m zeroing in on on a few interesting models.

GPT-4o scores a 70 here, while Gemini 2.0 Flash scores a 77 — but will cost you 22 times less. DeepSeek wins here with an 84 while also keeping costs low at $0.5.

MMLU Pro can be interesting, but it’s not enough to only look at one, we should check out a few independent ones— for conversational quality, code, and math.

Champions of Conversational Quality

Now MMLU Pro will rate knowledge across various subjects but we can also look at more independent benchmarks.

One that is interesting is the Chatbot Arena, where people rate conversations against different LLMs. This is then a sort of crowdsourced AI benchmarking that should tell us how well an LLM handles real-world conversations.

To participate in this one yourself, you can join in to rate the LLMs against each other here.

A high score here should mean the model generates more engaging and coherent responses, i.e. responses that feels more human. This could be ideal for RAG chatbots, as it indicates the model can integrate retrieved information into a natural conversation.

I grabbed a few models to place in the scatterplot below as I did for the MMLU Pro dataset.

I decided to map out Grok-3 even though I don’t have pricing for it yet, because it got such a high score the other day. Gemini Experimental 2.0 also scores high but no pricing yet there either (so not included here).

But let’s break it down for the ones we do have pricing for: GPT-4o takes the lead with a score of 1365, followed by DeepSeek R1 at 1361 and Gemini 2.0 Flash at 1355.

OpenAI’s o1 model sits at 1352 but costs $13.50, making it more expensive. Meanwhile, Claude Haiku and Claude Opus trail behind at the end with 1238 and 1247, respectively.

Gemini 2.0 Flash and GPT-4o could be a good choice for a chatbot as we want to keep the conversation quick and they are both cost-friendly but we should keep an eye on Grok-3 as well.

Highest Scores for Code Generation

One of the areas that gets the most attention in AI is obviously programming. For this subject we can look at several metrics, such as HumanEval (often self-reported), LiveBench Coding, and Aider’s Polyglot leaderboard.

Aider’s Polyglot leaderboard is also an independent evaluation that asks the LLM to complete the hardest 225 coding questions from Exercism in several programming languages.

This test should have been updated about two months ago with a much harder code editing benchmark since the previous one was saturated.

You can check my scribbles below for this chart.

Aider updated the Polyglot leaderboard just the other day with Anthropic’s newly released Claude Sonnet 3.7 Thinking model, which is now in the lead with 64.9%.

OpenAI’s o1 and o3 models are trailing behind at 61.7% and 60.4%. DeepSeek R1 and Claude Sonnet 3.5 also score high, at 56.90% and 51.60%, respectively.

Note: It o3-mini has different reasoning levels which means the pricing will increase with higher reasoning. You saw my chart earlier, you can expect that o3-mini-high will have around 40% more out tokens (the “reasoning” behind the scenes tokens). This goes for Claude’s Sonnet 3.7 Thinking model as well.

I should also note that Aider recommends combining R1 and Claude Sonnet 3.5, which achieves a higher score than o1 at 14 times less cost.

To compare against this leaderboard, we can also check LiveBench coding benchmarks. LiveBench says that they use fresh test data and objective scoring methods, making it a potentially fair eval.

If you check the scatterplot, we see o3-mini-high and Sonnet 3.7 (thinking) score the highest here as well.

We also see DeepSeek R1 scoring very high while keeping prices quite low, while OpenAI’s o1 does well but with a higher price tag.

Looking at the pricing benchmarks for o3-mini (low/medium), we can probably say that DeepSeek really helped pull down prices by forcing OpenAI’s hand a few weeks ago.

LLMs for Maths

One of the areas where reasoning models really shine is mathematics, so it’s worth checking out.

For this, we can use LiveBench as the evaluation metric as well. As noted above, LiveBench is a private benchmark that says it is continuously updated.

I’ve tested the Gemini models before with several questions from the Putnam dataset and know they’re quite good at mathematics, so I’m not surprised that Gemini 2.0 Flash is high up there alongside o3-mini.

Remember that thinking models usually excel in mathematics so that Gemini 1.5 Pro (that is not a reasoning model) score so high tells us that it is inherently good at it.

The winner in this area, though, is DeepSeek R1. It’s even outperforming OpenAI’s models, which is interesting. Claude’s new Sonnet 3.7 Thinking model also score very high here.

MMLU Pro, that we looked at in the beginning, includes math as well, but they rarely report on a per-subject basis, so it’s hard to evaluate. There’s also the MATH eval, but this one is also self-reported.

To do one last thing, let’s look at the LiveBench Data Analysis leaderboard to see how the models perform there.

We don’t see that much change here, except that Claude Sonnet 3.7 takes the lead here, followed by o3-mini-high. DeepSeek R1 and Gemini 2.0 Flash still score high.

Now, I need to mention that o3-mini-high and Sonnet 3.7 scores the highest on this leaderboard, but just remember that the thinking tokens generated will be higher and thus the pricing will also be higher than the non-thinking models.

For the price, both DeepSeek R1 and Gemini 2.0 Flash look very lucrative.

What to Consider (Benchmarks)

This was a fun exercise, but I wasn’t able to include all models as I couldn’t find API pricing for all of them so there are new players that are entering the game. Furthermore, benchmarks can be useful but doesn’t always paint the entire picture. There are quite a few things we should keep in mind when making comparisons like these.

First, many organizations test their models using CoT (zero-shot, or few-shot) strategies, which naturally help them perform better but this makes it hard to say how they truly compare.

Second, many organizations can accidentally contaminate their models with the datasets which will produce overfitting on test data during training. This is why looking at independent, and continuously updated evals is interesting. Third, some of the datasets themselves are rarely reviewed or curated, meaning they could contain mistakes or faulty answers.

Lastly, how it performs on your use case can vary. You may prefer something that has a lower eval but if it works, it works. Also remember to distinguish between reasoning and non-reasoning models, if you want to learn more about reasoning strategies read this article.

Final Notes

Remember, you can look at the real scatterplots here. The data for each model will be available in this sheet.

If you are looking to host your own open source models and are wondering about costs check out an older piece I did here.

To keep yourself updated, remember to regularly check these leaderboards: Chatbot Arena, Polyglot Leaderboard, and MMLU Pro Leaderboard.

Thanks for reading.

❤