Several recent tools have made it very easy to deploy SOTA LLMs locally. This can help with security concerns, improve telemetry, and also save hugely on costs.

The General Process

Thankfully, the general process of adding local models to your universal API is very easy, and is common across most libraries you would use local models from:

  • Implement your LLM function: This step involves writing your local llm logic as a function that receives the input arguments supported by the OpenAI Standard (except model), and returns an OpenAI-compatible response (refer this and this).
  • Register that function as a local model: The function can then be registered as a local model using unify.register_local_model.
  • Query the model: The registered model can now be queried using the unify python client by passing the model as <model_name>@local.

Examples

Ollama

Let’s take the example of a llama3.2:1b model that you’re using locally with Ollama and would like to query it through the unify python client:

  • Implement your LLM function: The custom function would look like:
def func(**kwargs):
    return completion(
        model="ollama/llama3.2:1b",
        **kwargs,
    )

Refer the LiteLLM docs for more details.

  • Register that function as a local model:
from unify import register_local_model
register_local_model("my-llama-3.2-1b", func)
  • Query the model: The registered model can now be queried as follows
from unify import Unify

client = Unify("my-llama-3.2-1b@local")
print(client.generate("hello world"))

HuggingFace

HuggingFace models would have a similar approach to the above, if you’re using them locally or you can just use a custom endpoint for inference endpoints.

llama.cpp

The process will again be very similar to Ollama and HuggingFace, the only major difference being the use of llama-cpp-python.