Appearance
RAG Suites
Specialized evaluation for Retrieval-Augmented Generation systems. These suites test whether AI responses are faithful to source documents and properly grounded.
rag-faithfulness
Tests whether generated responses are faithful to the retrieved context.
What it Measures
- Factual consistency - Claims match source documents
- No hallucination - No invented information
- Attribution accuracy - Facts traced to correct sources
- Completeness - Key information from context included
How it Works
Each case includes:
- Retrieved context (documents)
- User query
- Model response to evaluate
python
case = {
"context": [
"Document 1: The company was founded in 2015 by Jane Doe...",
"Document 2: Revenue reached $50M in 2024..."
],
"query": "When was the company founded and by whom?",
"response": "The company was founded in 2015 by Jane Doe.",
"evaluator": "faithfulness_check"
}Scoring
| Score | Interpretation |
|---|---|
| 100% | Fully faithful - all claims supported |
| 80-99% | Minor unsupported details |
| 50-79% | Some hallucinated content |
| <50% | Significant faithfulness issues |
Usage
python
from gateflow import EvalClient
client = EvalClient(api_key="gf-...")
# Evaluate RAG responses
results = client.run_suite(
suite="rag-faithfulness",
model="gpt-4o"
)
# Or evaluate your own RAG outputs
results = client.evaluate_rag(
cases=[
{
"context": retrieved_docs,
"query": user_query,
"response": model_response
}
],
evaluator="faithfulness"
)rag-groundedness
Evaluates whether each claim in the response is grounded in the provided context.
Claim-Level Analysis
python
results = client.run_suite(
suite="rag-groundedness",
model="gpt-4o"
)
# Get claim-by-claim breakdown
for case in results.cases:
print(f"Response: {case.response[:100]}...")
for claim in case.claims:
print(f" Claim: {claim.text}")
print(f" Grounded: {claim.is_grounded}")
print(f" Source: {claim.source_doc or 'NOT FOUND'}")Grounding Categories
- Fully grounded - Claim explicitly stated in context
- Partially grounded - Reasonable inference from context
- Ungrounded - No support in context (hallucination)
rag-citation
Tests citation accuracy and attribution.
What it Tests
- Citation presence - Are claims properly cited?
- Citation accuracy - Do citations point to correct sources?
- Citation completeness - Are all claims that should be cited, cited?
Example
python
case = {
"context": [
{"id": "doc_1", "text": "Revenue was $50M in 2024."},
{"id": "doc_2", "text": "Employee count reached 500."}
],
"response": "Revenue reached $50M [doc_1] with 500 employees [doc_2].",
"expected_citations": {
"Revenue reached $50M": "doc_1",
"500 employees": "doc_2"
}
}Scoring
| Metric | Description |
|---|---|
| Citation precision | % of citations that are correct |
| Citation recall | % of claims that should be cited and are |
| Citation F1 | Harmonic mean of precision and recall |
Evaluating Your RAG System
End-to-End Evaluation
python
from gateflow import EvalClient, RAGEvaluator
client = EvalClient(api_key="gf-...")
# Configure your RAG pipeline
evaluator = RAGEvaluator(
retriever=my_retriever, # Your retrieval function
generator=my_generator, # Your generation function
)
# Run comprehensive RAG eval
results = evaluator.evaluate(
queries=[
"What was the company's revenue in 2024?",
"Who founded the company?",
# ...
],
suites=["rag-faithfulness", "rag-groundedness", "rag-citation"]
)
print(results.summary)
# Faithfulness: 94.2%
# Groundedness: 91.8%
# Citation F1: 88.5%Component-Level Analysis
Identify where your RAG pipeline fails:
python
# Analyze failure modes
analysis = client.analyze_rag_failures(run_id="run_abc123")
print(analysis.breakdown)
# {
# "retrieval_failures": 15%, # Right answer not in context
# "generation_failures": 8%, # Answer in context but missed
# "hallucination": 5%, # Generated unsupported claims
# "attribution_errors": 3% # Wrong citation
# }RAG-Specific Metrics
Context Relevance
Is the retrieved context actually relevant?
python
results = client.evaluate_retrieval(
queries=queries,
retrieved_docs=docs,
ground_truth=expected_docs # Optional gold standard
)
print(results.metrics)
# {
# "precision@5": 0.82,
# "recall@5": 0.75,
# "mrr": 0.88,
# "ndcg": 0.84
# }Answer Correctness
Combined metric: Is the final answer correct?
python
results = client.evaluate(
cases=cases,
evaluator="rag_answer_correctness",
config={
"check_faithfulness": True,
"check_completeness": True,
"check_relevance": True
}
)Continuous RAG Monitoring
Set up production monitoring for your RAG system:
python
client.configure_sampling(
rate=0.05,
suites=["rag-faithfulness", "rag-groundedness"],
metadata_capture=["retrieved_docs", "query_embedding"],
alert_threshold=85
)Next Steps
- Drift Detection - Monitor RAG quality
- Routing Feedback - Route based on RAG scores
- LLM-as-Judge - Configure RAG evaluators