Weekly AI post: Testing AI Systems

Happy Friday! This post is part of my Friday AI news and tips series. This week I’ll be sharing some thoughts on testing non-deterministic GenAI-based systems.

Recently, I have been doing a lot of early-stage experiments with new AI-based pipelines at OpenFn. As part of this work, I have been thinking about testing and evaluation. I used to work with classification models, and there, evaluation was a much more straightforward process: you would create a dataset representing the classes you are trying to predict, and calculate your model precision and recall based on the model outputs. However, with more open-ended tasks, you have to think of new ways to create an evaluation process.

It is now a standard approach to use AI for evaluating AI. Often the underlying idea is that if you generate enough evaluations, you will approach a human-level evaluation. Designing this testing process is a far more challenging and creative task than before. Some new important testing paradigms are e.g. LLM-as-Judge (judge LLMs evaluate other LLMs according to some criteria), adversarial LLM-based testing (use an LLM to try and trick an LLM with large volumes of different attacks), LLM-based continuous evaluation (using LLMs to track performance in production).

One of the most challenging evaluation tasks in GenAI is the evaluation of general purpose models. All the leading LLM companies still measure their results using benchmark datasets, but these are riddled with issues. Several papers in recent years have found evidence that benchmarks used to evaluate leading LLMs have contaminated the training data of these models. For example, models might perform significantly better on benchmarks created before their release than those created afterward. Some models were found to be capable of reproducing data from these benchmarks).

The unreliability of these benchmarking processes may have eroded overall confidence in testing and evaluation in the field. However, there are some interesting new approaches to address these issues. For example, one paper suggests adding unlikely canary strings to benchmarks to check if a model can reproduce them.

In production, testing paradigms such as LLM-as-judge might seem like piling unreliable evaluations on unreliable systems. However, at the same time, they open up the possibility of much more extensive testing than before at affordable costs. For now, I’m embracing these imperfect testing methodologies while trying to remain conscious of the issues they might not capture.

3 Likes

Great post! And it’s been great to work with you on some of this stuff recently - I’m looking forward to digging into this in the new year.

Speaking as an enthusiastic AI skeptic, I’d always been very concerned about pipelines that have been trained/validated/tested with AI. Especially because of AI’s tendency to sound good.

But then again, it worked for AlphaZero, so who am I to judge?

As with all AI, I think maintaining a degree of scepticism is productive and healthy. If you’re using AI in evaluation, and you’re rigorously testing the judgements of that evaluation, then you can probably have a high degree of confidence in the performance of your model. That’s a good place to be, and that’s a useful tool in testing.

As I’m learning, using AI to scale tests and evaluations can be incredibly useful. And properly deployed is probably more robust than human testing at that sort of scale (I can’t tell you how many subtly broken tests I’ve written over the years. No, really, I can’t!)

One thing to watch out for, especially if using an external model or LLM, is that version numbers aren’t changing beneath you. Models are already non deterministic and if you provide that with a service that can change overnight, you really can’t trust the AI any more than a random number generator. Sure, it worked great today. But tomorrow?

And that’s something I particularly worry about: setting up a fantastic, bullet-proof AI evaluation suite which you can take to the bank. And then - for whatever reason - in three months time, when you’re looking elsewhere, it starts subtly failing.

Or even worse, maybe it takes 5 years to fail, the engineers have all moved on, and no-one notices the problem until serious harm has been done.

But that’s true of all good engineering. Challenge assumptions, test rigourously and regularly, strive for objectivity, and actively try to break your thing. You’ve gotta see your tests fail before they pass - otherwise how do you know you’re testing the right thing?

The canary thing is great, it’s kicked me off on a little rabbithole. Here’s a nice post about it for any curious readers who, like me, don’t have the concentration to dig into academic papers: The Problem with Benchmark Contamination in AI