Happy Friday! This post is part of my Friday AI news and tips series. This week I’ll be sharing some thoughts on testing non-deterministic GenAI-based systems.
Recently, I have been doing a lot of early-stage experiments with new AI-based pipelines at OpenFn. As part of this work, I have been thinking about testing and evaluation. I used to work with classification models, and there, evaluation was a much more straightforward process: you would create a dataset representing the classes you are trying to predict, and calculate your model precision and recall based on the model outputs. However, with more open-ended tasks, you have to think of new ways to create an evaluation process.
It is now a standard approach to use AI for evaluating AI. Often the underlying idea is that if you generate enough evaluations, you will approach a human-level evaluation. Designing this testing process is a far more challenging and creative task than before. Some new important testing paradigms are e.g. LLM-as-Judge (judge LLMs evaluate other LLMs according to some criteria), adversarial LLM-based testing (use an LLM to try and trick an LLM with large volumes of different attacks), LLM-based continuous evaluation (using LLMs to track performance in production).
One of the most challenging evaluation tasks in GenAI is the evaluation of general purpose models. All the leading LLM companies still measure their results using benchmark datasets, but these are riddled with issues. Several papers in recent years have found evidence that benchmarks used to evaluate leading LLMs have contaminated the training data of these models. For example, models might perform significantly better on benchmarks created before their release than those created afterward. Some models were found to be capable of reproducing data from these benchmarks).
The unreliability of these benchmarking processes may have eroded overall confidence in testing and evaluation in the field. However, there are some interesting new approaches to address these issues. For example, one paper suggests adding unlikely canary strings to benchmarks to check if a model can reproduce them.
In production, testing paradigms such as LLM-as-judge might seem like piling unreliable evaluations on unreliable systems. However, at the same time, they open up the possibility of much more extensive testing than before at affordable costs. For now, I’m embracing these imperfect testing methodologies while trying to remain conscious of the issues they might not capture.