As well as fallbacks, you can also route to directly maximize LLM performance, whether that be be improving cost, quality, speed or a combination.

Base Metrics

The base metrics and their permitted syntax when making chat completion queries, are as follows:

  • Quality (The quality of the generated response): quality, q
  • Time to First Token: time-to-first-token, ttft, t
  • Inter Token Latency: inter-token-latency, itl, i
  • Cost ($/M tkns, inp:out -> 3:1, ref): cost, c
  • Input Cost ($ per million input tokens): input-cost, ic
  • Output Cost ($ per million output tokens): output-cost, oc

Meta Providers

Any of these base metrics can be specified in place of a provider, when making chat completion requests. For example, lets assume we want to deploy claude-3-opus with whichever provider has the fastest inter-token-latency, based on the latest benchmark data. We can specify the following:

curl -X 'POST' \
    'https://api.unify.ai/v0/chat/completions' \
    -H "Authorization: Bearer $UNIFY_KEY" \
    -H 'Content-Type: application/json' \
    -d '{
        "model": "claude-3-opus@inter-token-latency",
        "messages": [{"role": "user", "content": "Hello."}]
    }'

In Python:

import unify
client = unify.Unify("claude-3-opus@inter-token-latency")
response = client.generate("Hello.")

The prepending keywords highest- and lowest- can also be specified. When the keyword is omitted, it’s assumed that the user wants the “best” option, which is highest- for quality and lowest- for all other metrics. The example above is equivalent to claude-3-opus@lowest-inter-token-latency.

Thresholds

You can specify upper and/or lower bounds for any of these metrics using <, >, <= and >= and separating each specification with | in the command.

For example, the following command will find the llama-3.1-405b-chat provider with the lowest inter token latency (the fastest), provided that the endpoint is not more expensive that $5 per million tokens:

llama-3.1-405b-chat@inter-token-latency|c<5

The following will find the llama-3.1-70b-chat provider with the highest quality (see explanation for how this is computed below). Differences such as quantization can result in different quality for the same model. The endpoint is also constrained to not be more expensive than $0.8 per million tokens in the input, not be more expensive than $0.6 per million tokens in the output, have an inter token latency faster than 20ms, but not faster than 1ms:

llama-3.1-70b-chat@quality|input-cost<=0.8|output-cost<=0.8|itl>1|itl<20

As explained above, any of the aliases defined above can be used interchangeably, such as i, itl or inter-token-latency for the inter token latency.

The Search Space

All of the examples above have assumed that all compatible providers are included in the “search space” when making routing decisions. However, you might want to limit the search space. This can be achieved with the providers and skip_providers keywords. Again, this is simply specified within the same chain, separated by |.

The following finds the fastest (lowest itl) endpoint for llama-3.1-405b-chat, among the providers: groq, fireworks-ai and together-ai:

llama-3.1-405b-chat@itl|providers:groq,fireworks-ai,together-ai

The following finds the fastest llama-3.1-405b-chat endpoint among all supported providers, excluding (skipping): azure-ai and aws-bedrock:

llama-3.1-405b-chat@itl|skip_providers:azure-ai,aws-bedrock

Feedback

If you’d like to do more sophisticated routing, such as routing to the best model (not only the provider), then let us know on discord and we’ll get an interface built for model routing! 🔀