CI/CD/CE: The Third Pillar of AI Development

Written by Herman Lintvelt

Updates

Originally posted on Substack

Your deployment pipeline is green. Your model is live. And somewhere in production, it’s quietly failing.

In the first article of this series, I introduced the concept of CI/CD/CE: Continuous Integration, Continuous Deployment, and Continuous Evaluation. We’ve covered how to specify and test AI systems. Now let’s talk about the part that keeps them working in production.

If you’ve shipped software before, you know the comfort of a good CI/CD pipeline. Code gets committed, tests run automatically, and builds deploy seamlessly. When something breaks, you know immediately. When you need to rollback, it’s a single command. It provides a safety net and comfort.

Then you add AI to your product.

Suddenly, your green CI pipeline doesn’t mean your AI feature is working. Your deployment succeeded, but your recommendation engine is suggesting nonsense. Your tests all passed, but users are complaining that the chatbot has gotten worse. Everything deployed fine, but somehow, performance is degrading.

Welcome to the world where deployment success doesn’t equal production success. Where your model can drift silently. Where yesterday’s acceptable performance becomes today’s user complaint.

This is why traditional CI/CD isn’t enough for AI systems. You need that third pillar: Continuous Evaluation.

Why CI/CD Alone Falls Short

Traditional CI/CD works when code behaves predictably: function in, expected output. Tests pass, deployment succeeds, monitoring shows green. You’re done.

AI breaks this model. The failure modes are different:

Silent degradation: Models don’t crash; they get worse. Accuracy drops from 85% to 70%. Users notice. Your pipeline doesn’t.

Data drift: User behavior shifts. Training data becomes stale. The model still runs, but relevance declines.

Adversarial inputs: Users game your system. Content moderation gets fooled by variations. Agents get jailbroken with poems. No error logs. Just failures.

Context shifts: Seasonality, trends, world events; all invalidate your model’s assumptions.

Model updates: Third-party LLMs get updated. Same API, different behaviour. Tests pass (API works), users complain (quality changed).

Traditional monitoring catches breaks. AI often doesn’t break—it just stops being useful.

The Continuous Evaluation Pipeline

The traditional pipeline validates code correctness. The AI pipeline adds quality validation at every stage:

Code commit
Run unit tests (deterministic parts)
Run integration tests (system behaviour)
Run evals (AI quality benchmarks)
Build artefact
Deploy to staging
Run production-like evals on staging
Deploy to production (canary/blue-green)
Monitor traditional metrics (uptime, latency, errors)
Monitor AI metrics (accuracy, confidence, drift)
Run continuous evals on production data
Alert on degradation
Auto-rollback on critical failures

The bold steps are new. They validate AI quality, not just code correctness. Let me break down what each layer involves.

Layer 1: Pre-Deployment Evals

Remember evals from the testing article? They’re not just for development; they’re gatekeepers in your deployment pipeline.

Running Evals in CI

Pre-deployment evals run before code reaches production. They validate that your AI meets minimum quality thresholds: accuracy, failure rates, latency, and confidence levels. If evals fail, deployment is blocked. No exceptions.

Think of it as running pytest, but for AI quality instead of code correctness. You define thresholds (”accuracy must be above 75%”, “zero catastrophic failures”), run your eval suite against a golden dataset, and compare results to your baseline from the main branch.

For this, DeepEval provides a solid framework. It integrates with pytest, gives you 30+ pre-built metrics for LLM evaluation, and supports both end-to-end and component-level testing. You can run evals locally during development, in CI before deployment, and even set up regression checks against your baseline performance.

The Living Eval Dataset

Your eval dataset can’t be static. It needs to evolve with your production experience:

Start with representative production scenarios and known edge cases
Add new cases when production failures occur
Include examples from user complaints and support escalations
Cover diverse user segments and use cases
Regularly prune outdated or duplicate cases

The pattern: production teaches you new failure modes, you convert those into eval cases, and those cases prevent regression. Your quality gate gets smarter over time.

In one project, our dataset grew from 20 cases at launch to 200+ within two months. Every production issue became an eval case. This turned our pre-deployment gate from “does it work in theory?” to “have we seen this break before?”

Layer 2: Deployment Strategies for AI

Traditional software (mostly) deploys atomically: old code out, new code in. AI systems need more nuance.

Three Deployment Techniques

Canary deployments: Start with 5-10% of traffic. Monitor AI quality metrics closely. If metrics hold, gradually increase. If quality degrades, rollback immediately. This caught multiple issues for us where pre-deployment evals looked fine, but production behaviour differed.

Feature flags: Decouple deployment from activation. Deploy the code, but control who sees the AI feature. Start with beta users, monitor quality, and expand gradually. When issues arise, flip the flag; no redeployment needed.

A/B testing: Run new models alongside old ones. Compare both quality metrics and user satisfaction. Sometimes better eval scores don’t translate to better user outcomes. A/B testing surfaces these mismatches before you commit to the change.

Layer 3: Production Monitoring

Deployment is just the beginning. AI systems need monitoring beyond traditional observability.

What to Monitor

Model performance: Track accuracy, precision, recall, F1; whatever quality metrics matter for your use case. Not just averages; watch the distribution. A model with 85% average accuracy but wild variance is unreliable.

Data drift: Input distributions change. User behaviour shifts. New edge cases emerge. Detect when incoming data looks significantly different from training data. This often predicts performance degradation before users complain.

Prediction patterns: Monitor what your model is predicting. Confidence distributions, prediction diversity, and rate of fallbacks to defaults. Unusual patterns often signal problems.

Business impact: The ultimate metric. Does the AI improve conversion? Reduce support tickets? Increase engagement? If quality metrics look good but business impact declines, something’s wrong.

Layer 4: Continuous Production Evaluation

Pre-deployment evals are validated before shipping. Production evals confirm it continues to work after shipping.

Tracing and Monitoring with Opik

For LLM-based systems, Opik can help to provide comprehensive tracing and evaluation in production. It captures every LLM call with full context: prompts, responses, token usage, and latency. You can trace multi-step agentic workflows, see where things go wrong, and evaluate production traces against your quality criteria.

Opik integrates with major LLM providers and frameworks. You instrument your code once, and it automatically captures traces. Then you can run evals on production data, compare versions, and build dashboards showing quality trends over time.

The key insight: production data is your best eval dataset. Real user queries, real edge cases, real failure modes. Opik helps you systematically evaluate that data rather than just logging it.

Human Review

Some quality aspects need human judgment. Sample strategically:

Low-confidence predictions
Cases with negative user feedback
Random samples for baseline comparison
Edge cases flagged by automated checks

In one project, our team reviewed anonymised interactions via Opik daily. This caught nuanced failures that automated evals missed and helped refine the eval criteria. Human review teaches your automated evals what to look for.

Layer 5: Automated Response

Monitoring without response is expensive logging. You need automated remediation.

Auto-Rollback

Define rollback conditions: accuracy drops by 10%, error rate spikes above 5%, or catastrophic failures exceed the threshold. When conditions trigger, wait briefly to confirm the issue persists (avoid rolling back on transient blips), then revert automatically to the previous version.

Include a grace period; don’t rollback immediately after deployment. Give the system time to warm up. But once the grace period passes and degradation is confirmed, rollback without human intervention. Alert the team, but don’t wait for approval.

Graceful Degradation

Sometimes rollback isn’t the answer. When AI quality degrades but doesn’t fail catastrophically, reduce reliance gradually:

Route a percentage of traffic to rule-based fallbacks or older versions of prompts.
Show AI results alongside traditional results or previous agent version results.
Disable AI for low-confidence scenarios only.
Set recovery criteria to restore full AI when quality returns automatically.

The Complete Pipeline

Here’s how it all fits together:

On every commit:

Run unit tests (deterministic code)
Run integration tests (system behaviour)
Run evals with DeepEval (AI quality)
Compare eval results to baseline
Block deployment if quality regressed

On deployment to staging:

Deploy the new version
Run evals against the staging environment
Verify quality in production-like conditions

On deployment to production:

Start canary at 5% traffic
Monitor quality metrics for 30 minutes
Compare canary metrics to control group
If quality holds, gradually increase to 100%
If quality degrades, auto-rollback

Continuously in production:

Opik traces every LLM interaction
Monitor quality metrics, data drift, prediction patterns
Run evals on production traces
Alert on degradation
Auto-rollback or graceful degradation as needed
Feed production failures back to eval dataset

The pipeline runs on every commit. Production monitoring runs continuously: hourly checks, real-time alerting, and ongoing evaluation.

The Human Element

Automation doesn’t eliminate humans. It redirects them to higher-value work, including:

AI Quality On-Call

This is separate from being on call for infrastructure or general system incidents. This function can include these responsibilities: respond to eval alerts, investigate accuracy degradations, approve/reject anomalous canary deployments, review human eval queue, and make rollback decisions for ambiguous cases.

Weekly Quality Reviews

The team reviews trends: eval performance over time, new failure modes, false positive/negative rates, user feedback patterns, performance across user segments.

These reviews feed back into the system: update eval criteria, adjust quality thresholds, reprioritise model improvements, and refine the feedback loop.

Getting Started

Don’t build everything at once. Iterate.

Step 1: Add basic monitoring. Collect AI metrics—confidence distributions, latency, fallback rates, error rates. No alerts yet. Just observe.

Step 2: Pre-deployment evals. Create a small eval dataset (20-50 cases). Run it in CI as a non-blocking check. Watch results, but don’t fail builds. Get comfortable with the concept.

Step 3: Basic canary. Deploy to 5% traffic, wait 30 minutes, go to 100%. Manual monitoring. Simple.

Step 4: Production evals. Run the eval suite against production data daily. Manual review. Alert on severe degradation only.

Later: Sophisticate gradually. Expand eval datasets from production learnings. Add automated rollback for critical issues. Implement A/B testing. Set up drift detection. Refine thresholds based on real data.

Start small. Build confidence. Let the pipeline earn trust before it enforces decisions.

The Uncomfortable Truth

Let me be honest: building CI/CD/CE pipelines for AI is significantly more work than traditional CI/CD.

You’ll spend time building eval infrastructure. You’ll iterate on quality metrics. You’ll have false alerts. You’ll need to educate your team on new concepts. Your deployment times will be longer.

But here’s the alternative: shipping AI features into production and hoping they keep working, and discovering quality degradation through user complaints. Making changes blind because you don’t know current performance.

The companies succeeding with AI aren’t necessarily those with the best models. They’re those with the best pipelines; pipelines that ensure their AI keeps delivering value even as the world changes around it.

The Third Pillar

Traditional software had two pillars: CI and CD. Build it right, ship it reliably. Once deployed, monitoring was about uptime and errors.

AI systems need that third pillar: Continuous Evaluation. Because deployment doesn’t mean it works; it means you can now measure whether it works.

CI/CD/CE isn’t about perfection. It’s about confidence. Confidence that you’ll know when your AI stops working. Confidence that you can prevent bad deployments. Confidence that you can respond quickly when issues arise.

The fundamentals still matter. We’re just adding one more fundamental to the foundation.

This is the fourth in a series exploring engineering practices for AI-era development. We’ve covered engineering fundamentals, specification techniques, and testing strategies. Coming next: “Iterative AI: Learning to Fail Fast with Intelligence”—adapting agile methodologies for AI’s unique uncertainties.

Additional Resources:

MLOps: Continuous delivery and automation pipelines in machine learning - Google Cloud’s comprehensive guide
Continuous Evaluation for LLMs - Hamel Husain’s practical approach to production evals
Canary Deployments - Martin Fowler on deployment strategies
Monitoring Machine Learning Models in Production - Comprehensive monitoring guide