AI Testing Frameworks: DeepEval, Ragas, LangSmith, CI Integration

Introduction

Testing AI applications is fundamentally different from testing traditional software. LLM outputs are non-deterministic, there is no single correct answer, and failures manifest as subtle quality degradations rather than crashes. Dedicated AI testing frameworks address these challenges with automated evaluation metrics, test case generation, and regression detection.

DeepEval

DeepEval is an open-source testing framework designed for LLM applications:

from deepeval import assert_test

from deepeval.metrics import (

HallucinationMetric,

AnswerRelevancyMetric,

ContextualPrecisionMetric,

FaithfulnessMetric,

BiasMetric,

ToxicityMetric,

)

from deepeval.test_case import LLMTestCase

def test_rag_response_no_hallucination():

test_case = LLMTestCase(

input="What is the capital of France?",

actual_output="The capital of France is Paris, located in the Ile-de-France region.",

retrieval_context=["Paris is the capital and most populous city of France."],

)

hallucination_metric = HallucinationMetric(threshold=0.3)

assert_test(test_case, [hallucination_metric])

def test_response_relevancy():

test_case = LLMTestCase(

input="Explain Kubernetes pods",

actual_output="Kubernetes is a container orchestration platform...",

retrieval_context=[

"A pod is the smallest deployable unit in Kubernetes.",

"Pods can contain one or more containers.",

)

relevancy_metric = AnswerRelevancyMetric(threshold=0.7)

assert_test(test_case, [relevancy_metric])

Run tests: deepeval test run test_ai.py

DeepEval supports 15+ evaluation metrics including hallucination detection, answer relevancy, faithfulness, contextual precision, bias detection, and toxicity scoring. Each metric returns a score (0-1) that can be compared against a configurable threshold.

Ragas

Ragas is specialized for evaluating RAG pipelines end-to-end:

from ragas import evaluate

from ragas.metrics import (

faithfulness,

answer_relevancy,

context_precision,

context_recall,

)

from datasets import Dataset

Prepare evaluation dataset

test_data = Dataset.from_dict({

"question": [

"What is a Kubernetes pod?",

"How does load balancing work?",

"answer": [

"A pod is the smallest deployable unit...",

"Load balancing distributes traffic...",

"contexts": [

["Pods can contain one or more containers."],

["Load balancers distribute incoming traffic."],

"ground_truth": [

"A pod is the smallest deployable unit in Kubernetes.",

"Load balancing distributes network traffic across servers.",

})

Compute RAG metrics

result = evaluate(

test_data,

metrics=[

faithfulness,

answer_relevancy,

context_precision,

context_recall,

)

print(result)

{

"faithfulness": 0.92,

"answer_relevancy": 0.88,

"context_precision": 0.95,

"context_recall": 0.85

}

Ragas decomposes RAG quality into four independent metrics: faithfulness (is the answer grounded in context?), answer relevancy (does the answer address the question?), context precision (are retrieved documents relevant?), and context recall (are all relevant documents retrieved?).

LangSmith

LangSmith provides a hosted evaluation platform with tracing and annotation:

from langsmith import Client, evaluate

from langsmith.schemas import Example, Run

client = Client()

Define a custom evaluator

def answer_correctness(run: Run, example: Example) -> dict:

Compare model output against expected output

predicted = run.outputs.get("output", "")

expected = example.outputs.get("answer", "")

Use LLM-as-judge for evaluation

from langsmith.evaluation import evaluate as langsmith_eval

return {"score": compute_similarity(predicted, expected)}

Run evaluation on a dataset

results = evaluate(

lambda inputs: my_llm_chain(inputs["question"]),

data="my-test-dataset",

evaluators=[answer_correctness],

experiment_prefix="rag-v2-eval",

)

View results in LangSmith dashboard

print(results)

LangSmith excels at trace-level evaluation. Every LLM call, retrieval step, and prompt template is recorded and can be compared across experiments.

CI Integration

Integrate AI tests into your CI pipeline to catch regressions automatically:

.github/workflows/ai-tests.yml

name: AI Evaluation Tests

on:

push:

branches: [main]

pull_request:

jobs:

evaluate:

runs-on: ubuntu-latest

steps:

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\- uses: actions/checkout@v4

with:

python-version: "3.11"

run: pip install deepeval ragas langsmith

env:

OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

LANGCHAIN_API_KEY: ${{ secrets.LANGCHAIN_API_KEY }}

run: |

deepeval test run tests/ai_evaluation.py

run: |

deepeval metrics --min-threshold 0.7

Set a quality gate that blocks merges when metrics fall below thresholds. This prevents gradual quality degradation that is invisible in traditional tests.

Conclusion

AI testing requires specialized frameworks that understand non-deterministic outputs. Use DeepEval for unit-test-style evaluation with configurable metrics. Use Ragas for end-to-end RAG pipeline evaluation. Use LangSmith for trace-level debugging and comparison across experiments. Integrate all three into CI pipelines with automated quality gates to maintain production AI quality over time.

AI Testing Frameworks: DeepEval, Ragas, LangSmith, CI Integration

Introduction

DeepEval

Run tests: deepeval test run test_ai.py

Ragas

Prepare evaluation dataset

Compute RAG metrics

{

"faithfulness": 0.92,

"answer_relevancy": 0.88,

"context_precision": 0.95,

"context_recall": 0.85

}

LangSmith

Define a custom evaluator

Compare model output against expected output

Use LLM-as-judge for evaluation

Run evaluation on a dataset

View results in LangSmith dashboard

CI Integration

.github/workflows/ai-tests.yml

Conclusion

🤖 AI Model Cost Calculator