quality, q
time-to-first-token, ttft, t
inter-token-latency, itl, i
cost, c
input-cost, ic
output-cost, oc
claude-3-opus
with whichever provider has the fastest inter-token-latency
,
based on the latest benchmark data. We can specify the following:
highest-
and lowest-
can also be specified.
When the keyword is omitted, it’s assumed that the user wants the “best” option, which
is highest-
for quality and lowest-
for all other metrics.
The example above is equivalent to claude-3-opus@lowest-inter-token-latency
.
<
, >
, <=
and >=
and separating each specification with |
in the command.
For example, the following command will find the llama-3.1-405b-chat
provider with the lowest inter
token latency (the fastest), provided that the endpoint is not more expensive that $5
per million tokens:
llama-3.1-70b-chat
provider with the highest quality
(see explanation for how this is computed below).
Differences such as quantization can result in different quality for the same model.
The endpoint is also constrained to not be more expensive than $0.8
per million tokens
in the input, not be more expensive than $0.6
per million tokens in the output,
have an inter token latency faster than 20ms
, but not faster than 1ms
:
i
, itl
or inter-token-latency
for the inter token latency.
providers
and skip_providers
keywords.
Again, this is simply specified within the same chain, separated by |
.
The following finds the fastest (lowest itl
) endpoint for llama-3.1-405b-chat
,
among the providers: groq
, fireworks-ai
and together-ai
:
llama-3.1-405b-chat
endpoint among all supported
providers, excluding (skipping): azure-ai
and aws-bedrock
: