Benchmarking endpoints#

In this section, you’ll explore how to navigate our benchmarks and run your own custom ones.

Note

You can learn about the methodology behind our benchmarks, and the various metrics used, on the benchmarks design section.

Quality benchmarks#

To compare the quality of different LLMs, you head to the Dashboard page on the console.

By default, all endpoints will be plotted on six datasets, with the OpenHermes dataset shown first. The default router will also be plotted, with various configurations of this router plotted as stars.

On the dataset dropdown at the top, you can select any dataset of prompts to benchmark each model and provider against. The scatter graph will then be replotted for the selected dataset.

Similarily, you can change the metric plotted on the x axis from cost to something else, by clicking on the Metric dropdown. This will let us plot the score against time-to-first-token (TTFT) for e.g.

You can remove any of these points by simply clicking on the model names on the legend. That model will then be removed from the graph, and the router points will be updated to only account for the remaining endpoints.

Runtime benchmarks#

The benchmarks displayed on the console allow you to compare the average quality and runtime performance of LLM endpoints. As explained on the benchmarks design section, runtime metrics tend to change over time.

For granular runtime benchmarks, head to the benchmarks interface outside of the console. There, you can find a list of popular LLM endpoints, periodically updated with new models and providers.

Each page contains a suite of runtime benchmarks providing timely information on the speed, cost and latency of the endpoints exposed by different endpoint providers.

Note

You can learn more about endpoints providers on the dedicated endpoints section

For e.g, the image below corresponds to the benchmark page for mistral-7b-instruct-v0.2.

The plot displays the changing value of the metric selected on the table for the region and sequence length specified, across time and providers. On the other hand, the table displays the latest values for all metrics across providers, and allows you to sort them by metric.

You can plot a different metric on the graph by clicking on the graph icon next metric’s column label. For e.g, the image below shows how the plot for TTFT reveals different performance patterns than the default Output Tokens / Sec figure.

Running your own benchmarks (Beta)#

If you are using custom endpoints or need to compare endpoints for a specific task, you can customize the benchmarks to fit your needs.

Note

If you haven’t done so, we recommend you learn how you can add your own datasets and endpoints to the console before resuming.

Once you’ve added your endpoints and / or datasets, head to the Benchmarks page on the console. There, you can see all of the current and previous benchmark jobs you triggered, and you can also specify which endpoints you would like to include for benchmarking.

Runtime benchmarks#

If you have various private endpoints deployed across various servers, each with varying latencies, it can be useful to track these speeds across time, to ensure you’re always sending your requests to the fastest servers.

To trigger periodic runtime benchmarking for a custom endpoint, simply add it to the list under Runtime Benchmarks. You also need to specify at least one IP address from where you would like to test this endpoint, and also at least one prompt dataset against which you would like to perform the benchmarking.

Once all endpoints are added, you can then go to the benchmarks interface where you’ll find your model listed with a lock icon, indicating that the benchmark is private (only accessible from your account).

You can then open the benchmark page like with any other model, and view the performance for various metrics plotted across time.

Quality benchmarks#

In order to train a router, or just compare the quality of endpoints, it’s necessary to first evaluate the performance of each model on each prompt of a dataset.

On the Quality Benchmarks subsection. You can click on Submit Job to trigger a new benchmark comparing the output quality across different LLMs.

You need to specify the endpoints and datasets you would like to benchmark.

Note

All selected endpoints will be tested on all selected datasets. So, if you only want to test some endpoints on some datasets, then you should submit separate jobs.

Once you are happy with the selection, press Submit and the job will appear under Running Jobs, as shown below.

The job can be expanded, to see each endpoint and dataset pair, and check the progress.

Console Benchmarks Quality Quality Jobs.

The entire history of benchmarking jobs can also be viewed by clicking on History, like so.

Once the benchmarking is complete, you can visualize the results in the Dashboard page.

Like before, we can select the dataset through the Dataset dropdown. In this case, we’ll plot the benchmarks for the custom dataset we uploaded.

We can see that the custom endpoints mixtral-tuned-finances, llama-3-tuned-calls1 and llama-3-tuned-calls2 we added earlier are all plotted, alongside the foundation router, which is always plotted by default.

Round Up#

That’s it! You now know how to compare LLM endpoints on quality and runtime metrics, and run benchmarks on your own endpoints and datasets. In the next section, we’ll learn how to train and deploy a custom router.