Local Models
Several recent tools have made it very easy to deploy SOTA LLMs locally. This can help with security concerns, improve telemetry, and also save hugely on costs.
The General Process
Thankfully, the general process of adding local models to your universal API is very easy, and is common across most libraries you would use local models from:
- Implement your LLM function: This step involves writing your local llm logic as a function
that receives the input arguments supported by the OpenAI Standard (except
model
), and returns an OpenAI-compatible response (refer this and this). - Register that function as a local model: The function can then be registered as a
local model using
unify.register_local_model
. - Query the model: The registered model can now be queried using the
unify
python client by passing themodel
as<model_name>@local
.
Examples
Ollama
Let’s take the example of a llama3.2:1b
model that you’re using locally with Ollama and
would like to query it through the unify
python client:
- Implement your LLM function: The custom function would look like:
Refer the LiteLLM docs for more details.
- Register that function as a local model:
- Query the model: The registered model can now be queried as follows
HuggingFace
HuggingFace models would have a similar approach to the above, if you’re using them locally or you can just use a custom endpoint for inference endpoints.
llama.cpp
The process will again be very similar to Ollama and HuggingFace, the only major difference being the use of llama-cpp-python.