RAG Suites

Specialized evaluation for Retrieval-Augmented Generation systems. These suites test whether AI responses are faithful to source documents and properly grounded.

rag-faithfulness

Tests whether generated responses are faithful to the retrieved context.

What it Measures

Factual consistency - Claims match source documents
No hallucination - No invented information
Attribution accuracy - Facts traced to correct sources
Completeness - Key information from context included

How it Works

Each case includes:

Retrieved context (documents)
User query
Model response to evaluate

python

case = {
    "context": [
        "Document 1: The company was founded in 2015 by Jane Doe...",
        "Document 2: Revenue reached $50M in 2024..."
    ],
    "query": "When was the company founded and by whom?",
    "response": "The company was founded in 2015 by Jane Doe.",
    "evaluator": "faithfulness_check"
}

Scoring

Score	Interpretation
100%	Fully faithful - all claims supported
80-99%	Minor unsupported details
50-79%	Some hallucinated content
<50%	Significant faithfulness issues

Usage

python

from gateflow import EvalClient

client = EvalClient(api_key="gf-...")

# Evaluate RAG responses
results = client.run_suite(
    suite="rag-faithfulness",
    model="gpt-4o"
)

# Or evaluate your own RAG outputs
results = client.evaluate_rag(
    cases=[
        {
            "context": retrieved_docs,
            "query": user_query,
            "response": model_response
        }
    ],
    evaluator="faithfulness"
)

rag-groundedness

Evaluates whether each claim in the response is grounded in the provided context.

Claim-Level Analysis

python

results = client.run_suite(
    suite="rag-groundedness",
    model="gpt-4o"
)

# Get claim-by-claim breakdown
for case in results.cases:
    print(f"Response: {case.response[:100]}...")
    for claim in case.claims:
        print(f"  Claim: {claim.text}")
        print(f"  Grounded: {claim.is_grounded}")
        print(f"  Source: {claim.source_doc or 'NOT FOUND'}")

Grounding Categories

Fully grounded - Claim explicitly stated in context
Partially grounded - Reasonable inference from context
Ungrounded - No support in context (hallucination)

rag-citation

Tests citation accuracy and attribution.

What it Tests

Citation presence - Are claims properly cited?
Citation accuracy - Do citations point to correct sources?
Citation completeness - Are all claims that should be cited, cited?

Example

python

case = {
    "context": [
        {"id": "doc_1", "text": "Revenue was $50M in 2024."},
        {"id": "doc_2", "text": "Employee count reached 500."}
    ],
    "response": "Revenue reached $50M [doc_1] with 500 employees [doc_2].",
    "expected_citations": {
        "Revenue reached $50M": "doc_1",
        "500 employees": "doc_2"
    }
}

Scoring

Metric	Description
Citation precision	% of citations that are correct
Citation recall	% of claims that should be cited and are
Citation F1	Harmonic mean of precision and recall

Evaluating Your RAG System

End-to-End Evaluation

python

from gateflow import EvalClient, RAGEvaluator

client = EvalClient(api_key="gf-...")

# Configure your RAG pipeline
evaluator = RAGEvaluator(
    retriever=my_retriever,  # Your retrieval function
    generator=my_generator,  # Your generation function
)

# Run comprehensive RAG eval
results = evaluator.evaluate(
    queries=[
        "What was the company's revenue in 2024?",
        "Who founded the company?",
        # ...
    ],
    suites=["rag-faithfulness", "rag-groundedness", "rag-citation"]
)

print(results.summary)
# Faithfulness: 94.2%
# Groundedness: 91.8%
# Citation F1: 88.5%

Component-Level Analysis

Identify where your RAG pipeline fails:

python

# Analyze failure modes
analysis = client.analyze_rag_failures(run_id="run_abc123")

print(analysis.breakdown)
# {
#   "retrieval_failures": 15%,    # Right answer not in context
#   "generation_failures": 8%,    # Answer in context but missed
#   "hallucination": 5%,          # Generated unsupported claims
#   "attribution_errors": 3%      # Wrong citation
# }

RAG-Specific Metrics

Context Relevance

Is the retrieved context actually relevant?

python

results = client.evaluate_retrieval(
    queries=queries,
    retrieved_docs=docs,
    ground_truth=expected_docs  # Optional gold standard
)

print(results.metrics)
# {
#   "precision@5": 0.82,
#   "recall@5": 0.75,
#   "mrr": 0.88,
#   "ndcg": 0.84
# }

Answer Correctness

Combined metric: Is the final answer correct?

python

results = client.evaluate(
    cases=cases,
    evaluator="rag_answer_correctness",
    config={
        "check_faithfulness": True,
        "check_completeness": True,
        "check_relevance": True
    }
)

Continuous RAG Monitoring

Set up production monitoring for your RAG system:

python

client.configure_sampling(
    rate=0.05,
    suites=["rag-faithfulness", "rag-groundedness"],
    metadata_capture=["retrieved_docs", "query_embedding"],
    alert_threshold=85
)

Next Steps

Drift Detection - Monitor RAG quality
Routing Feedback - Route based on RAG scores
LLM-as-Judge - Configure RAG evaluators

RAG Suites ​

rag-faithfulness ​

What it Measures ​

How it Works ​

Scoring ​

Usage ​

rag-groundedness ​

Claim-Level Analysis ​

Grounding Categories ​

rag-citation ​

What it Tests ​

Example ​

Scoring ​

Evaluating Your RAG System ​

End-to-End Evaluation ​

Component-Level Analysis ​

RAG-Specific Metrics ​

Context Relevance ​

Answer Correctness ​

Continuous RAG Monitoring ​

Next Steps ​

RAG Suites

rag-faithfulness

What it Measures

How it Works

Scoring

Usage

rag-groundedness

Claim-Level Analysis

Grounding Categories

rag-citation

What it Tests

Example

Scoring

Evaluating Your RAG System

End-to-End Evaluation

Component-Level Analysis

RAG-Specific Metrics

Context Relevance

Answer Correctness

Continuous RAG Monitoring

Next Steps