When building LLM apps, the first question is usually, where do we start…? Should we just take an off-the-shelf LLM and throw it into production? Probably not right?

Should we create an evaluation set with 1000s of hypothetical failure modes before putting it in front of any users at all? Also probably not?

In general, the following pseudocode is the best practice to get your LLM app off the ground 🚀

  1. While True:
  2.     Update unit tests (evals) 🗂️
  3.     while run(tests) failing: 🧪
  4.         Vary system prompt, in-context examples, available tools etc. 🔁
  5.     Beta test with users, find more failures from production traffic 🚦

So, the first step is to add unit tests. While it might feel a bit early to be adding unit tests, how else are you going to express what you want the LLM to actually do?

In a similar mindset to the philosophy of test-driven-development, adding unit tests as step 1 is a good way to define the bare-minimum requirements that we expect our app to be able to deal with, and then we can also create a bare-minimum solution, and start the data flywheel spinning! 🎡