Skip to content

RAG Suites

Specialized evaluation for Retrieval-Augmented Generation systems. These suites test whether AI responses are faithful to source documents and properly grounded.

rag-faithfulness

Tests whether generated responses are faithful to the retrieved context.

What it Measures

  • Factual consistency - Claims match source documents
  • No hallucination - No invented information
  • Attribution accuracy - Facts traced to correct sources
  • Completeness - Key information from context included

How it Works

Each case includes:

  1. Retrieved context (documents)
  2. User query
  3. Model response to evaluate
python
case = {
    "context": [
        "Document 1: The company was founded in 2015 by Jane Doe...",
        "Document 2: Revenue reached $50M in 2024..."
    ],
    "query": "When was the company founded and by whom?",
    "response": "The company was founded in 2015 by Jane Doe.",
    "evaluator": "faithfulness_check"
}

Scoring

ScoreInterpretation
100%Fully faithful - all claims supported
80-99%Minor unsupported details
50-79%Some hallucinated content
<50%Significant faithfulness issues

Usage

python
from gateflow import EvalClient

client = EvalClient(api_key="gf-...")

# Evaluate RAG responses
results = client.run_suite(
    suite="rag-faithfulness",
    model="gpt-4o"
)

# Or evaluate your own RAG outputs
results = client.evaluate_rag(
    cases=[
        {
            "context": retrieved_docs,
            "query": user_query,
            "response": model_response
        }
    ],
    evaluator="faithfulness"
)

rag-groundedness

Evaluates whether each claim in the response is grounded in the provided context.

Claim-Level Analysis

python
results = client.run_suite(
    suite="rag-groundedness",
    model="gpt-4o"
)

# Get claim-by-claim breakdown
for case in results.cases:
    print(f"Response: {case.response[:100]}...")
    for claim in case.claims:
        print(f"  Claim: {claim.text}")
        print(f"  Grounded: {claim.is_grounded}")
        print(f"  Source: {claim.source_doc or 'NOT FOUND'}")

Grounding Categories

  • Fully grounded - Claim explicitly stated in context
  • Partially grounded - Reasonable inference from context
  • Ungrounded - No support in context (hallucination)

rag-citation

Tests citation accuracy and attribution.

What it Tests

  • Citation presence - Are claims properly cited?
  • Citation accuracy - Do citations point to correct sources?
  • Citation completeness - Are all claims that should be cited, cited?

Example

python
case = {
    "context": [
        {"id": "doc_1", "text": "Revenue was $50M in 2024."},
        {"id": "doc_2", "text": "Employee count reached 500."}
    ],
    "response": "Revenue reached $50M [doc_1] with 500 employees [doc_2].",
    "expected_citations": {
        "Revenue reached $50M": "doc_1",
        "500 employees": "doc_2"
    }
}

Scoring

MetricDescription
Citation precision% of citations that are correct
Citation recall% of claims that should be cited and are
Citation F1Harmonic mean of precision and recall

Evaluating Your RAG System

End-to-End Evaluation

python
from gateflow import EvalClient, RAGEvaluator

client = EvalClient(api_key="gf-...")

# Configure your RAG pipeline
evaluator = RAGEvaluator(
    retriever=my_retriever,  # Your retrieval function
    generator=my_generator,  # Your generation function
)

# Run comprehensive RAG eval
results = evaluator.evaluate(
    queries=[
        "What was the company's revenue in 2024?",
        "Who founded the company?",
        # ...
    ],
    suites=["rag-faithfulness", "rag-groundedness", "rag-citation"]
)

print(results.summary)
# Faithfulness: 94.2%
# Groundedness: 91.8%
# Citation F1: 88.5%

Component-Level Analysis

Identify where your RAG pipeline fails:

python
# Analyze failure modes
analysis = client.analyze_rag_failures(run_id="run_abc123")

print(analysis.breakdown)
# {
#   "retrieval_failures": 15%,    # Right answer not in context
#   "generation_failures": 8%,    # Answer in context but missed
#   "hallucination": 5%,          # Generated unsupported claims
#   "attribution_errors": 3%      # Wrong citation
# }

RAG-Specific Metrics

Context Relevance

Is the retrieved context actually relevant?

python
results = client.evaluate_retrieval(
    queries=queries,
    retrieved_docs=docs,
    ground_truth=expected_docs  # Optional gold standard
)

print(results.metrics)
# {
#   "precision@5": 0.82,
#   "recall@5": 0.75,
#   "mrr": 0.88,
#   "ndcg": 0.84
# }

Answer Correctness

Combined metric: Is the final answer correct?

python
results = client.evaluate(
    cases=cases,
    evaluator="rag_answer_correctness",
    config={
        "check_faithfulness": True,
        "check_completeness": True,
        "check_relevance": True
    }
)

Continuous RAG Monitoring

Set up production monitoring for your RAG system:

python
client.configure_sampling(
    rate=0.05,
    suites=["rag-faithfulness", "rag-groundedness"],
    metadata_capture=["retrieved_docs", "query_embedding"],
    alert_threshold=85
)

Next Steps

Built with reliability in mind.