Base Metrics
The base metrics and their permitted syntax when making chat completion queries, are as follows:- Quality (The quality of the generated response):
quality, q
- Time to First Token:
time-to-first-token, ttft, t
- Inter Token Latency:
inter-token-latency, itl, i
- Cost ($/M tkns, inp:out -> 3:1, ref):
cost, c
- Input Cost ($ per million input tokens):
input-cost, ic
- Output Cost ($ per million output tokens):
output-cost, oc
Meta Providers
Any of these base metrics can be specified in place of a provider, when making chat completion requests. For example, lets assume we want to deployclaude-3-opus
with whichever provider has the fastest inter-token-latency
,
based on the latest benchmark data. We can specify the following:
highest-
and lowest-
can also be specified.
When the keyword is omitted, it’s assumed that the user wants the “best” option, which
is highest-
for quality and lowest-
for all other metrics.
The example above is equivalent to claude-3-opus@lowest-inter-token-latency
.
Thresholds
You can specify upper and/or lower bounds for any of these metrics using<
, >
, <=
and >=
and separating each specification with |
in the command.
For example, the following command will find the llama-3.1-405b-chat
provider with the lowest inter
token latency (the fastest), provided that the endpoint is not more expensive that $5
per million tokens:
llama-3.1-70b-chat
provider with the highest quality
(see explanation for how this is computed below).
Differences such as quantization can result in different quality for the same model.
The endpoint is also constrained to not be more expensive than $0.8
per million tokens
in the input, not be more expensive than $0.6
per million tokens in the output,
have an inter token latency faster than 20ms
, but not faster than 1ms
:
i
, itl
or inter-token-latency
for the inter token latency.
The Search Space
All of the examples above have assumed that all compatible providers are included in the “search space” when making routing decisions. However, you might want to limit the search space. This can be achieved with theproviders
and skip_providers
keywords.
Again, this is simply specified within the same chain, separated by |
.
The following finds the fastest (lowest itl
) endpoint for llama-3.1-405b-chat
,
among the providers: groq
, fireworks-ai
and together-ai
:
llama-3.1-405b-chat
endpoint among all supported
providers, excluding (skipping): azure-ai
and aws-bedrock
: