Routing
As well as fallbacks, you can also route to directly maximize LLM performance, whether that be be improving cost, quality, speed or a combination.
Base Metrics
The base metrics and their permitted syntax when making chat completion queries, are as follows:
- Quality (The quality of the generated response):
quality, q
- Time to First Token:
time-to-first-token, ttft, t
- Inter Token Latency:
inter-token-latency, itl, i
- Cost ($/M tkns, inp:out -> 3:1, ref):
cost, c
- Input Cost ($ per million input tokens):
input-cost, ic
- Output Cost ($ per million output tokens):
output-cost, oc
Meta Providers
Any of these base metrics can be specified in place of a provider,
when making chat completion requests. For example, lets assume we want to deploy
claude-3-opus
with whichever provider has the fastest inter-token-latency
,
based on the latest benchmark data. We can specify the following:
In Python:
The prepending keywords highest-
and lowest-
can also be specified.
When the keyword is omitted, it’s assumed that the user wants the “best” option, which
is highest-
for quality and lowest-
for all other metrics.
The example above is equivalent to claude-3-opus@lowest-inter-token-latency
.
Thresholds
You can specify upper and/or lower bounds for any of these metrics using <
, >
, <=
and >=
and separating each specification with |
in the command.
For example, the following command will find the llama-3.1-405b-chat
provider with the lowest inter
token latency (the fastest), provided that the endpoint is not more expensive that $5
per million tokens:
The following will find the llama-3.1-70b-chat
provider with the highest quality
(see explanation for how this is computed below).
Differences such as quantization can result in different quality for the same model.
The endpoint is also constrained to not be more expensive than $0.8
per million tokens
in the input, not be more expensive than $0.6
per million tokens in the output,
have an inter token latency faster than 20ms
, but not faster than 1ms
:
As explained above, any of the aliases defined above can be used interchangeably,
such as i
, itl
or inter-token-latency
for the inter token latency.
The Search Space
All of the examples above have assumed that all compatible providers are
included in the “search space” when making routing decisions.
However, you might want to limit the search space.
This can be achieved with the providers
and skip_providers
keywords.
Again, this is simply specified within the same chain, separated by |
.
The following finds the fastest (lowest itl
) endpoint for llama-3.1-405b-chat
,
among the providers: groq
, fireworks-ai
and together-ai
:
The following finds the fastest llama-3.1-405b-chat
endpoint among all supported
providers, excluding (skipping): azure-ai
and aws-bedrock
:
Feedback
If you’d like to do more sophisticated routing, such as routing to the best model (not only the provider), then let us know on discord and we’ll get an interface built for model routing! 🔀