Testing the Untestable: Strategies for Non-Deterministic Systems

Written by Herman Lintvelt

Originally posted on Substack

Your test suite assumes predictable outputs. Your AI doesn’t. Now what?

In the first article of this series, I argued that engineering fundamentals matter more in the AI era. In the second, we explored how to write specifications that account for uncertainty. Today, let’s tackle what might be the most uncomfortable question: how do you test something that behaves differently every time?

If you’ve been writing software for a while, you know the comfort of a green test suite. You write a function. You test it with known inputs. You assert on expected outputs. The test either passes or fails. Deterministic. Reliable. Sane.

Then you add an LLM to your product.

Suddenly, your test that checks “categorise this transaction as ‘groceries’” fails randomly. Or passes when it should fail. Or the AI gives you a perfectly reasonable answer that’s technically correct but not what your test expects. Your carefully crafted assertions become a game of whack-a-mole.

Welcome to testing non-deterministic systems. It’s uncomfortable, it requires new thinking, but it’s absolutely essential. Because shipping AI features without testing is like driving with your eyes closed: you might get somewhere, but probably not where you intended to go, and most probably stuck in an unforeseen river crossing.

Why Traditional Testing Breaks

Let me show you where traditional testing falls apart. Here’s a test you might write for a deterministic system:

def test_expense_categorization(): expense = “Starbucks Coffee” category = categorize_expense(expense) assert category == “Food & Dining”

Simple. Clear. And completely inadequate for AI systems.

Run this test against an LLM-based categoriser and you’ll get:

“Food & Dining” (great!)
“Coffee Shops” (reasonable)
“Restaurants & Cafes” (also fine)
“Personal Care” (wait, what?)
“Entertainment” (now we have a problem)

Same input. Different outputs. And some of them are actually defensible: Is Starbucks food or entertainment? Depends on context.

Traditional unit tests assume a single correct answer exists. AI systems often have multiple valid responses and occasionally incorrect ones. This isn’t a testing problem to solve; it’s a reality to design for.

The Three Testing Layers

Through building several AI-powered products, We’ve settled on a three-layer testing approach that provides meaningful confidence without requiring perfect determinism:

1. Unit Tests (Traditional + Adapted)

Test the deterministic parts and the AI boundaries

2. Integration Tests (Behaviour Validation)

Test that AI components interact correctly within your system.

3. Evals (Quality Assessment)

Test that AI outputs meet quality thresholds at scale.

Let’s dive into each layer.

Layer 1: Unit Tests for AI Systems

You can still write useful unit tests for AI-powered features. You just need to test different things.

Test the Deterministic Parts

Most AI features have significant non-AI code around them:

Input validation and sanitisation
Prompt construction
Response parsing
Error handling
Caching logic
Rate limiting

All of this should be tested traditionally:

def test_prompt_construction():
user_input = “What’s the weather?” context = {”location”: “San Francisco”, “user”: “Alice”} prompt = build_prompt(user_input, context) assert “San Francisco” in prompt assert “Alice” in prompt assert len(prompt) < MAX_PROMPT_LENGTH

These tests are fast, deterministic, and catch regressions in your scaffolding code.

Test the AI Boundaries

Even if you can’t predict exact AI outputs, you can test the contract:

def test_categorisation_returns_valid_category(): result = categorise_expense(”Starbucks Coffee”) assert result in VALID_CATEGORIES assert isinstance(result, str) assert len(result) > 0

This doesn’t care if the AI chose “Food & Dining” or “Coffee Shops”; it just ensures the response is structurally valid and within expected bounds.

Test Confidence Thresholds

If your system uses confidence scores (it should), test the logic that acts on them:

def test_low_confidence_triggers_fallback(): # Mock AI with low confidence with mock_ai_response(category=”Groceries”, confidence=0.45): result = categorise_with_fallback(”Ambiguous Store”) assert result.used_fallback == True assert result.category in DEFAULT_CATEGORIES

The AI is mocked here. You’re testing your handling logic, not the AI itself.

Test Error Handling

AI services fail. Rate limits hit. Networks drop. Test your resilience:

def test_ai_timeout_handling(): with mock_ai_timeout(): result = categorize_expense(”Coffee”) assert result.success == False assert result.error_type == “timeout” assert result.fallback_category is not None

On Tracto, these error-handling tests caught critical issues before production. When the hosted AI model we use went down, our app gracefully degraded instead of crashing.

Layer 2: Integration Tests

Integration tests verify that your AI components play nicely with the rest of your system. These tests accept non-determinism but validate behaviour.

Test the Full Pipeline

def test_expense_categorisation_end_to_end(): # Real expense data expense = create_test_expense( description=”Whole Foods Market”, amount=45.23 ) # Run through real AI categorisation categorised = process_expense(expense) # Validate behaviour, not exact output assert categorised.category in FOOD_RELATED_CATEGORIES assert categorised.confidence >= 0.7 assert categorised.processing_time < 2.0 # seconds assert categorised.metadata.get(”ai_model”) is not None

This test runs against your actual AI integration (maybe using a test API key). It doesn’t demand a specific category, but it validates reasonable behaviour.

In Writing User Stories for Uncertain Systems, we looked at how to write specifications for uncertain systems, and here we are now coding those ideas into unit tests.

Test Learning and Adaptation

If your AI learns from user feedback:

def test_categorisation_improves_with_feedback(): expense = “Joe’s Corner Shop” # First categorisation (baseline) initial = categorise_expense(expense) # User corrects it provide_feedback(expense, correct_category=”Groceries”) # Subsequent categorisation second = categorise_expense(expense) # Should show learning assert second.confidence >= initial.confidence # For similar expenses in same session similar = categorise_expense(”Joe’s Corner Shop purchase”) assert similar.category == “Groceries”

This validates that the learning mechanism works without requiring perfect prediction.

Test Multi-Step AI Interactions

Many AI features involve chains of calls:

def test_health_recommendation_pipeline(): user = create_test_user(diagnosis=”ADHD”, age=8) # AI should: understand context -> search content -> rank -> personalise recommendations = get_video_recommendations(user, limit=5) # Validate the pipeline assert len(recommendations) == 5 assert all(rec.relevance_score > 0.6 for rec in recommendations) assert all(”ADHD” in rec.tags for rec in recommendations) assert recommendations[0].score >= recommendations[-1].score # Ranked assert any(rec.personalisation_signals for rec in recommendations)

You’re not testing what specific videos were chosen, but that each pipeline stage contributed appropriately.

Layer 3: Evals (The Game Changer)

Evals (evaluations) are how you test AI quality at scale. This is where you accept non-determinism but measure whether the AI meets your specified thresholds.

Think of evals as QA for AI: instead of testing for exact correctness, you test for acceptable quality across a representative sample.

Building an Eval Set

Start by creating a test dataset of real-world scenarios with multiple valid answers:

EVAL_SET = [ { “input”: “Starbucks Coffee”, “acceptable_categories”: [”Food & Dining”, “Coffee Shops”, “Restaurants”], “unacceptable_categories”: [”Transportation”, “Healthcare”] }, { “input”: “Uber to airport”, “acceptable_categories”: [”Transportation”, “Travel”], “unacceptable_categories”: [”Food & Dining”, “Entertainment”] }, # ... 100+ examples covering edge cases ]

Notice we specify acceptable ranges, not single answers. This acknowledges that multiple responses could be correct.

Running Evals

An eval runs your AI against this dataset and measures aggregate quality:

def run_categorisation_eval(): results = { “acceptable”: 0, “unacceptable”: 0, “total”: len(EVAL_SET) } for test_case in EVAL_SET: category = categorise_expense(test_case[”input”]) if category in test_case[”acceptable_categories”]: results[”acceptable”] += 1 elif category in test_case[”unacceptable_categories”]: results[”unacceptable”] += 1 log_failure(test_case, category) accuracy = results[”acceptable”] / results[”total”] assert accuracy >= 0.85, f”Accuracy {accuracy} below threshold” assert results[”unacceptable”] == 0, “Found unacceptable categorisations” return results

This eval passes if 85%+ of categorisations are acceptable and 0% are explicitly wrong. That’s your spec from the user story turned into an automated test.

There are numerous libraries, frameworks and tools out there to help develop, manage and run evals on agentic systems, and even create them from real-world data or traces for your AI agents. I will go into more detail and show some of these in a future article.

Eval Strategies for Different AI Tasks

Classification Tasks (like expense categorisation):

Measure accuracy against acceptable categories
Track confidence distribution
Monitor edge case handling

Generation Tasks (like content recommendations):

Use LLM-as-judge to rate outputs
Measure diversity and relevance
Track user engagement as ground truth

Conversational AI:

Evaluate on multi-turn consistency
Check safety and appropriateness
Measure goal completion rates

Here’s an LLM-as-judge eval concept for video recommendations:

def eval_recommendation_quality(user_context, recommendations): “”“Use GPT-4 to judge recommendation relevance”“” prompt = f”“” Given this user context: {user_context} Rate these video recommendations on relevance (0-10): {recommendations} Respond with JSON: {{”scores”: [score1, score2, ...], “reasoning”: “...”}} “”“ judgment = call_gpt4(prompt) avg_score = mean(judgment[”scores”]) return { “average_relevance”: avg_score, “passing”: avg_score >= 7.0, “reasoning”: judgment[”reasoning”] }

This lets you evaluate subjective quality at scale without manual review.

The Real-World Pattern

Let me show you how these layers work together with a real example from Tracto.

We had an AI feature that personalised parent training video recommendations. Here’s how we tested it:

Unit Tests (Fast, run on every commit):

Prompt construction includes all required context
Response parsing handles malformed JSON
Confidence thresholds trigger correct fallback logic
Cache invalidation works correctly
Rate limiting prevents API overuse

Integration Tests (Slower, run before merge):

End-to-end recommendation flow completes successfully
Videos returned are from the active content library
Recommendations improve after the user provides feedback
System degrades gracefully when the AI service is down
Telemetry captures all expected metrics

Evals (Slowest, run nightly + on major changes):

85%+ of recommendations rated relevant by LLM-as-judge
Recommendations improve with interaction history
No recommendations of archived/inappropriate content
Performance meets <1500ms latency requirement
Diversity: no user receives the same recommendation twice

This layered approach gave us confidence to ship. The unit tests caught bugs fast. The integration tests verified system behaviour. The evals ensured quality at scale.

When we changed our prompt strategy, the unit tests passed (structure was fine), integration tests passed (pipeline worked), but evals caught that recommendation diversity tanked. We fixed it before shipping.

The Uncomfortable Truth

Let me be honest: testing AI systems is less satisfying than testing traditional code.

You won’t get that dopamine hit of a perfectly green test suite with 100% passing tests. You’ll get “85% accuracy” and “confidence threshold met” and “no catastrophic failures detected.”

It feels fuzzy. It feels uncertain. Because it is.

But that’s not a reason to skip testing; it’s a reason to test more thoughtfully. Your tests won’t guarantee perfection, but they’ll prevent disasters and catch regressions.

The companies shipping reliable AI features aren’t the ones with perfect models. They’re the ones with robust testing that catches when things go wrong.

Start Here

If you’re building AI features without adapted testing practices, start small:

Add one eval: Pick your most critical AI feature. Create a 20-item eval set. Run it manually.
Test boundaries: Add unit tests for your AI scaffolding code—the parts that should be deterministic.
Monitor one metric: In production, track one quality metric continuously. Just one.
Set one threshold: Define what “acceptable” means quantitatively, then test against it.

These four changes will catch 80% of AI-related issues. You can sophisticate later.

The goal isn’t perfect testing, it’s confident shipping. Tests that help you sleep at night, knowing your AI feature probably won’t embarrass you in production.

In the AI era, testing is still about confidence. We’ve just had to redefine what confidence means when dealing with systems that surprise us.

Because the code might be non-deterministic, but our commitment to quality should be rock solid.

Additional Resources:

OpenAI Evals Framework - Industry-standard eval tooling
The Pragmatic Programmer - Timeless testing wisdom that applies to AI (As an Amazon Associate, I earn from qualifying purchases to help support the blogging time.)